Additional Stage Libraries with dockerized StreamSets


In my previous article New Package Manager in Action I showed 3 possible methods for installing additional Stage Libraries onto a StreamSets Data Collector core distribution.

In this article I will demonstrate how this can be used with the Docker distribution of StreamSets Data Collector (SDC). Starting with SDC v2.1.0.1, the public Docker image on Docker Hub is no longer based on the full version but on the smaller, customizable core version.

So how can we use additional stage libraries with a dockerized SDC? There are basically two ways to deal with it.

  1. the Manual Installation: just using the 3 alternative ways as shown in my previous article New Package Manager in Action
  2. the Automatic Installation: using a derived Docker image to automatically install the needed stage libraries either at build time or at run(start) time.

1. Manual Installation

For the manual approach we can use the three alternative ways shown in the previous blog article:

Using the UI

Start a docker container:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

Open StreamSets Data Collector on http://localhost:18630 and navigate to Package Manager http://localhost:18630/collector/packageManager. Install the required stage libraries and then restart the docker container:

docker restart sdc

Using the CLI

Start a container the same way as above:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

To use the CLI, create a new bash session in the running sdc container:

docker exec -ti sdc bash

Inside the bash shell you can install all the necessary stage libraries using the streamsets stagelibs command. Unfortunately there is a problem with the stagelibs script when running on Alpine Linux, which the StreamSets Docker instance is based on (GitHub Ticket). The following command changes the problematic -status arg to -s:

sed -i -e 's/run sha1sum --status/run sha1sum -s/g'  $SDC_DIST/libexec/_stagelibs

After that fix, the stagelibs command should work:

$SDC_DIST/bin/streamsets stagelibs -install=streamsets-datacollector-apache-kafka_0_10-lib

In order to activate the libraries, exit from the bash session inside the container and restart this container:

docker restart sdc

Using the RESTful API

Again start a docker container:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

Navigate to http://localhost:18630/collector/restapi to open the RESTful API page. Here you can add the needed stage libraries as documented in my previous blog “3. Installing using the RESTful API”.

In order to activate the libraries, restart the docker container:

docker restart sdc

2. Automatic Installation

For the automatic installation, I have created a new Docker image, which extends the Streamsets docker image and adds the necessary commands to install additional stage libraries. It uses a variable which can be specified either at build-time or run-time.

The source of this extended docker image can be found in the TrivadisBDS GitHub. The built image is available on DockerHub.

The Dockerfile below shows the extension done to the standard StreamSets docker image:

FROM streamsets/datacollector:2.1.0.1
MAINTAINER Guido Schmutz <guido.schmutz@trivadis.com>

ARG ADD_LIBS

RUN sed -i -e 's/run sha1sum --status/run sha1sum -s/g'  ${SDC_DIST}/libexec/_stagelibs

RUN if [ "$ADD_LIBS" != "" ]; then ${SDC_DIST}/bin/streamsets stagelibs -install=${ADD_LIBS}; fi

COPY docker-entrypoint.sh /
ENTRYPOINT ["/docker-entrypoint.sh"]
CMD ["dc", "-exec"]

 

  • The fix for running the stagelibs command under Alpine Linux (GitHub Ticket) is applied.
  • If the ADD_LIBS variable is used at build time, the the stagelibs command is run
  • An extended version of the docker-entrypoint.sh file is invoked.

 

The new version of docker-entrypoint.sh includes an additional line (shown in bold below) to run the stagelibs command, in case the ADD_LIBS variable is used at run time.

...
if [ "$ADD_LIBS" != "" ]; then ${SDC_DIST}/bin/streamsets stagelibs -install="${ADD_LIBS}"; fi

exec "${SDC_DIST}/bin/streamsets" "$@"

So how can it be used?

Installing stage libraries at build time

First let’s see how to use the automatic install at build time and create a SDC Docker image containing Kafka v10 and Kudu v9.

Get the source from the GitHub repository and navigate into the streamsets folder:

git clone https://github.com/TrivadisBDS/dockerfiles
cd dockerfiles/streamsets

You have to pass the full name of the stage libraries through the build-time variable ADD_LIBS:

docker build
     --build-arg ADD_LIBS="streamsets-datacollector-apache-kafka_0_10-lib,streamsets-datacollector-apache-kudu-0_9-lib" 
     -t trivadisbds/streamsets-datacollector-kafka-kudu:latest .

Now the new docker image can be used similar to the standard StreamSets docker image.

docker run -p 18630:18630 -d --name sdc trivadisbds/streamsets-datacollector-kafka-kudu:latest

This results in a re-usable StreamSets Data Collector image including the two optional stage libraries Kafka and Kudu. It can be started as often as you may need it.

Installing state libraries at run time

The enhanced docker image can also be used to install additional libraries at run-time, before actually starting SDC itself. Run the docker image and specify the stage libraries through the ADD_LIBS environment variable.

docker run -p 18630:18630 -d --name sdc -e ADD_LIBS="streamsets-datacollector-apache-kafka_0_10-lib,streamsets-datacollector-apache-kudu-0_9-lib" trivadisbds/streamsets-datacollector:latest

As these stage libraries are now installed before running SDC itself, it will take much longer until SDC is available. You can easily check progress with the logs command:

docker logs -f sdc

This allows us to flexibly run any configuration of a containerized StreamSets Data Collector, installing the necessary stage libraries on demand.

Summary

We have seen both the manual as well as automatic way of using stage libraries with a dockerized StreamSets Data Collector deployment.

Of course you want to use the automatic approach to produce reproducible instances of SDC.

  • The built time approach allows you to create different configuration of SDC, maybe because you want to run it at different places in your architecture (gateway, datacenter, …).
  • The run time approach allows you to configure the shape of your SDC instance at start time, offering the highest flexibility with the cost of additional start up time.
  • The built time approach on the other hand provides quick start up time and increased stability due to everything “baked” into multiple docker images.

Will ask StreamSets, if they could support the automatic install in the “official” Docker image, so my version would no longer be necessary. But in the meantime it’s a nice thing to have🙂