Last week in Stream Processing & Analytics 10/31/2016

This is the 38th installment of my blog series around Stream Processing and Analytics.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Comparison

Apache Kafka

Apache Storm / Heron

Spark Streaming

Apache Flink

Apache NiFi / Hortonworks Data Flow (HDF)

New Presentations

New Videos

New Podcasts

New Releases

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

Advertisements

Last week in Stream Processing & Analytics 10/25/2016

This is the 37th installment of my blog series around Stream Processing and Analytics.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Apache Kafka

Apache Storm / Heron

Apache Samza

Apache Flink

Apache Beam

StreamSets

Apache NiFi / Hortonworks Data Flow (HDF)

New Presentations

New Videos

New Releases

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

Additional Stage Libraries with dockerized StreamSets

In my previous article New Package Manager in Action I showed 3 possible methods for installing additional Stage Libraries onto a StreamSets Data Collector core distribution.

In this article I will demonstrate how this can be used with the Docker distribution of StreamSets Data Collector (SDC). Starting with SDC v2.1.0.1, the public Docker image on Docker Hub is no longer based on the full version but on the smaller, customizable core version.

So how can we use additional stage libraries with a dockerized SDC? There are basically two ways to deal with it.

  1. the Manual Installation: just using the 3 alternative ways as shown in my previous article New Package Manager in Action
  2. the Automatic Installation: using a derived Docker image to automatically install the needed stage libraries either at build time or at run(start) time.

1. Manual Installation

For the manual approach we can use the three alternative ways shown in the previous blog article:

Using the UI

Start a docker container:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

Open StreamSets Data Collector on http://localhost:18630 and navigate to Package Manager http://localhost:18630/collector/packageManager. Install the required stage libraries and then restart the docker container:

docker restart sdc

Using the CLI

Start a container the same way as above:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

To use the CLI, create a new bash session in the running sdc container:

docker exec -ti sdc bash

Inside the bash shell you can install all the necessary stage libraries using the streamsets stagelibs command. Unfortunately there is a problem with the stagelibs script when running on Alpine Linux, which the StreamSets Docker instance is based on (GitHub Ticket). The following command changes the problematic -status arg to -s:

sed -i -e 's/run sha1sum --status/run sha1sum -s/g'  $SDC_DIST/libexec/_stagelibs

After that fix, the stagelibs command should work:

$SDC_DIST/bin/streamsets stagelibs -install=streamsets-datacollector-apache-kafka_0_10-lib

In order to activate the libraries, exit from the bash session inside the container and restart this container:

docker restart sdc

Using the RESTful API

Again start a docker container:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

Navigate to http://localhost:18630/collector/restapi to open the RESTful API page. Here you can add the needed stage libraries as documented in my previous blog “3. Installing using the RESTful API”.

In order to activate the libraries, restart the docker container:

docker restart sdc

2. Automatic Installation

For the automatic installation, I have created a new Docker image, which extends the Streamsets docker image and adds the necessary commands to install additional stage libraries. It uses a variable which can be specified either at build-time or run-time.

The source of this extended docker image can be found in the TrivadisBDS GitHub. The built image is available on DockerHub.

The Dockerfile below shows the extension done to the standard StreamSets docker image:

FROM streamsets/datacollector:2.1.0.1
MAINTAINER Guido Schmutz <guido.schmutz@trivadis.com>

ARG ADD_LIBS

RUN sed -i -e 's/run sha1sum --status/run sha1sum -s/g'  ${SDC_DIST}/libexec/_stagelibs

RUN if [ "$ADD_LIBS" != "" ]; then ${SDC_DIST}/bin/streamsets stagelibs -install=${ADD_LIBS}; fi

COPY docker-entrypoint.sh /
ENTRYPOINT ["/docker-entrypoint.sh"]
CMD ["dc", "-exec"]

 

  • The fix for running the stagelibs command under Alpine Linux (GitHub Ticket) is applied.
  • If the ADD_LIBS variable is used at build time, the the stagelibs command is run
  • An extended version of the docker-entrypoint.sh file is invoked.

 

The new version of docker-entrypoint.sh includes an additional line (shown in bold below) to run the stagelibs command, in case the ADD_LIBS variable is used at run time.

...
if [ "$ADD_LIBS" != "" ]; then ${SDC_DIST}/bin/streamsets stagelibs -install="${ADD_LIBS}"; fi

exec "${SDC_DIST}/bin/streamsets" "$@"

So how can it be used?

Installing stage libraries at build time

First let’s see how to use the automatic install at build time and create a SDC Docker image containing Kafka v10 and Kudu v9.

Get the source from the GitHub repository and navigate into the streamsets folder:

git clone https://github.com/TrivadisBDS/dockerfiles
cd dockerfiles/streamsets

You have to pass the full name of the stage libraries through the build-time variable ADD_LIBS:

docker build
     --build-arg ADD_LIBS="streamsets-datacollector-apache-kafka_0_10-lib,streamsets-datacollector-apache-kudu-0_9-lib" 
     -t trivadisbds/streamsets-datacollector-kafka-kudu:latest .

Now the new docker image can be used similar to the standard StreamSets docker image.

docker run -p 18630:18630 -d --name sdc trivadisbds/streamsets-datacollector-kafka-kudu:latest

This results in a re-usable StreamSets Data Collector image including the two optional stage libraries Kafka and Kudu. It can be started as often as you may need it.

Installing state libraries at run time

The enhanced docker image can also be used to install additional libraries at run-time, before actually starting SDC itself. Run the docker image and specify the stage libraries through the ADD_LIBS environment variable.

docker run -p 18630:18630 -d --name sdc -e ADD_LIBS="streamsets-datacollector-apache-kafka_0_10-lib,streamsets-datacollector-apache-kudu-0_9-lib" trivadisbds/streamsets-datacollector:latest

As these stage libraries are now installed before running SDC itself, it will take much longer until SDC is available. You can easily check progress with the logs command:

docker logs -f sdc

This allows us to flexibly run any configuration of a containerized StreamSets Data Collector, installing the necessary stage libraries on demand.

Summary

We have seen both the manual as well as automatic way of using stage libraries with a dockerized StreamSets Data Collector deployment.

Of course you want to use the automatic approach to produce reproducible instances of SDC.

  • The built time approach allows you to create different configuration of SDC, maybe because you want to run it at different places in your architecture (gateway, datacenter, …).
  • The run time approach allows you to configure the shape of your SDC instance at start time, offering the highest flexibility with the cost of additional start up time.
  • The built time approach on the other hand provides quick start up time and increased stability due to everything “baked” into multiple docker images.

Will ask StreamSets, if they could support the automatic install in the “official” Docker image, so my version would no longer be necessary. But in the meantime it’s a nice thing to have 🙂

StreamSets Data Collector – New Package Manager in Action

Update 25.10.2016: added link to new blog article showing how to install additional stage libraries when working with a dockerized instance of StreamSets.

A few days ago, a new  version of StreamSets Data Collector v2.1.0.0 has been announced. It contains a couple of new interesting features.

This blog article shows how to use the new Package Manager to install only the stage libraries in StreamSets Data Collector (origins, processors and destinations) you actually need.

The new StreamSets version provides two distributions:

  • a full version as with the previous versions, containing everything and
  • a smaller customizable download, with only the core version of Data Collector (currently only available as a tarball and with the Docker image, package options will be added later)

After installation of the core version, only a few of the stage libraries are available:

streamsets-core

The customization of the core Data Collector package can be done in 3 ways:

  1. Using the User Interface
  2. Using the CLI
  3. Using the RESTful API

Let’s see them in action….

1. Installing using the User Interface

The user interface contains a new icon in the menu bar for opening the Package Manager:

streamsets-package-manager

A new screen shows the available stage libraries, which can be installed. You can also see which of the libraries are (pre-)installed.

streamsets-packages

Let’s add the latest Kafka library, by selecting the Apache Kafka 0.10.0.0 item and then click on the + icon to install.

streamsets-packages-install

After the install finishes, you have to restart StreamSets Data Collector and Kafka will be available.

2. Installing using the CLI

The second option is using the stagelibs command to install the additional libraries to use.

Note: The stagelibs command requires that curl version 7.18.1 or later and sha1sum utilities are installed on the machine. Verify that these utilities are installed before running the command.

The available libraries can be found in the StreamSets documentation or by using the following command from the command line:

$SDC_DIST/bin/streamsets stagelibs -list

This provides a list of all available stage libraries and also whether they are already installed.

streamsets-libraries-list.jpg

To install one or more stage libraries, use the following command from the command line, here  again to install the Kafka 0.10.x library:

$SDC_DIST/bin/streamsets stagelibs \ 
                   -install=streamsets-datacollector-apache-kafka_0_10-lib

Use the full name of the libraries that you want to install, as shown in the list above. You can install multiple libraries, separating them with commas. Do not include spaces in the command.

You can also use the stagelibs command to generate a script reflecting the necessary install commands to replicate an existing StreamSets instance.

$SDC_DIST/bin/streamsets stagelibs -installScript

3. Installing using the RESTful API

The third and last option I discovered by accident. StreamSets RESTful API can be used to manage the stage libraries as well.

You can reach the documentation page of the API by selecting the RESTful API item available in the help menu.

streamsets-open-restful

On the Data Collector RESTful API overview page, navigate to the definitions operations.

streamsets-overview-restful.jpg

There are several operations, one for listing available stage libraries and one for installing/uninstalling stage libraries.

To install the Kafka 0.10.x library, click on the /v1/stageLibraries/install operation and add the name of the full name of the stage library into the body field:

streamsets-install-lib-restful

The body has to be a JSON array, therefore you have to provide the name of the stage library as a string and inside brackets. Of course you can also specify multiple libraries. Click on try it out to install the additional library. The service should answer with a response code 200. In order to activate the library, StreamSets Data Colector has to be restarted.

Summary

In this article I have presented 3 options for installing additional stage libraries on the  StreamSets core distribution.

I really like the idea of a StreamSets core and being able to customize the additional libraries as needed. This simplifies it from a usability perspective, where now a user only sees what he should use. So in case of Kafka there might be only one or two versions installed and visible, instead of 4 (currently with the full version). I hope that this also reduces the runtime footprint of StreamSets, which of course is important if running it on IoT Gatway type of hardware, such as a Raspberry Pi.

So which one of the 3 options should be used?

This of course depends on your requirements. The first version is a pure manual approach, whereas the 2nd and 3rd can be used to automate the deployment.

This blog article shows how to handle additional stage libraries when provisioning StreamSets through Docker. With StreamSets 2.1.0.x, the docker image is no longer a full distribution, but  based on the core tarball only.

 

Last week in Stream Processing & Analytics 10/17/2016

This is the 36th installment of my blog series around Stream Processing and Analytics.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Comparison

Apache Kafka

Apache Storm / Heron

Apache Spark Streaming

Apache Flink

StreamSets

Apache NiFi / Hortonworks Data Flow (HDF)

Azure Stream Analytics

New Presentations

New Videos

New Podcasts

New Releases

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

Last week in Stream Processing & Analytics 10/11/2016

This is the 35th installment of my blog series around Stream Processing and Analytics.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Apache Kafka

Apache Flink

StreamSets

Apache NiFi / Hortonworks Data Flow (HDF)

New Presentations

New Videos

New Podcasts

New Releases

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

Last week in Stream Processing & Analytics 10/3/2016

This is the 34th installment of my blog series around Stream Processing and Analytics.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Comparison

Spark Streaming

Apache Kafka

Apache Flink

Concord

Oracle Stream Analytics

StreamSets

Apache NiFi / Hortonworks Data Flow (HDF)

New Presentations

New Videos

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!