Additional Stage Libraries with dockerized StreamSets

In my previous article New Package Manager in Action I showed 3 possible methods for installing additional Stage Libraries onto a StreamSets Data Collector core distribution.

In this article I will demonstrate how this can be used with the Docker distribution of StreamSets Data Collector (SDC). Starting with SDC v2.1.0.1, the public Docker image on Docker Hub is no longer based on the full version but on the smaller, customizable core version.

So how can we use additional stage libraries with a dockerized SDC? There are basically two ways to deal with it.

  1. the Manual Installation: just using the 3 alternative ways as shown in my previous article New Package Manager in Action
  2. the Automatic Installation: using a derived Docker image to automatically install the needed stage libraries either at build time or at run(start) time.

1. Manual Installation

For the manual approach we can use the three alternative ways shown in the previous blog article:

Using the UI

Start a docker container:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

Open StreamSets Data Collector on http://localhost:18630 and navigate to Package Manager http://localhost:18630/collector/packageManager. Install the required stage libraries and then restart the docker container:

docker restart sdc

Using the CLI

Start a container the same way as above:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

To use the CLI, create a new bash session in the running sdc container:

docker exec -ti sdc bash

Inside the bash shell you can install all the necessary stage libraries using the streamsets stagelibs command. Unfortunately there is a problem with the stagelibs script when running on Alpine Linux, which the StreamSets Docker instance is based on (GitHub Ticket). The following command changes the problematic -status arg to -s:

sed -i -e 's/run sha1sum --status/run sha1sum -s/g'  $SDC_DIST/libexec/_stagelibs

After that fix, the stagelibs command should work:

$SDC_DIST/bin/streamsets stagelibs -install=streamsets-datacollector-apache-kafka_0_10-lib

In order to activate the libraries, exit from the bash session inside the container and restart this container:

docker restart sdc

Using the RESTful API

Again start a docker container:

docker run -p 18630:18630 -d --name sdc streamsets/datacollector:2.1.0.1

Navigate to http://localhost:18630/collector/restapi to open the RESTful API page. Here you can add the needed stage libraries as documented in my previous blog “3. Installing using the RESTful API”.

In order to activate the libraries, restart the docker container:

docker restart sdc

2. Automatic Installation

For the automatic installation, I have created a new Docker image, which extends the Streamsets docker image and adds the necessary commands to install additional stage libraries. It uses a variable which can be specified either at build-time or run-time.

The source of this extended docker image can be found in the TrivadisBDS GitHub. The built image is available on DockerHub.

The Dockerfile below shows the extension done to the standard StreamSets docker image:

FROM streamsets/datacollector:2.1.0.1
MAINTAINER Guido Schmutz <guido.schmutz@trivadis.com>

ARG ADD_LIBS

RUN sed -i -e 's/run sha1sum --status/run sha1sum -s/g'  ${SDC_DIST}/libexec/_stagelibs

RUN if [ "$ADD_LIBS" != "" ]; then ${SDC_DIST}/bin/streamsets stagelibs -install=${ADD_LIBS}; fi

COPY docker-entrypoint.sh /
ENTRYPOINT ["/docker-entrypoint.sh"]
CMD ["dc", "-exec"]

 

  • The fix for running the stagelibs command under Alpine Linux (GitHub Ticket) is applied.
  • If the ADD_LIBS variable is used at build time, the the stagelibs command is run
  • An extended version of the docker-entrypoint.sh file is invoked.

 

The new version of docker-entrypoint.sh includes an additional line (shown in bold below) to run the stagelibs command, in case the ADD_LIBS variable is used at run time.

...
if [ "$ADD_LIBS" != "" ]; then ${SDC_DIST}/bin/streamsets stagelibs -install="${ADD_LIBS}"; fi

exec "${SDC_DIST}/bin/streamsets" "$@"

So how can it be used?

Installing stage libraries at build time

First let’s see how to use the automatic install at build time and create a SDC Docker image containing Kafka v10 and Kudu v9.

Get the source from the GitHub repository and navigate into the streamsets folder:

git clone https://github.com/TrivadisBDS/dockerfiles
cd dockerfiles/streamsets

You have to pass the full name of the stage libraries through the build-time variable ADD_LIBS:

docker build
     --build-arg ADD_LIBS="streamsets-datacollector-apache-kafka_0_10-lib,streamsets-datacollector-apache-kudu-0_9-lib" 
     -t trivadisbds/streamsets-datacollector-kafka-kudu:latest .

Now the new docker image can be used similar to the standard StreamSets docker image.

docker run -p 18630:18630 -d --name sdc trivadisbds/streamsets-datacollector-kafka-kudu:latest

This results in a re-usable StreamSets Data Collector image including the two optional stage libraries Kafka and Kudu. It can be started as often as you may need it.

Installing state libraries at run time

The enhanced docker image can also be used to install additional libraries at run-time, before actually starting SDC itself. Run the docker image and specify the stage libraries through the ADD_LIBS environment variable.

docker run -p 18630:18630 -d --name sdc -e ADD_LIBS="streamsets-datacollector-apache-kafka_0_10-lib,streamsets-datacollector-apache-kudu-0_9-lib" trivadisbds/streamsets-datacollector:latest

As these stage libraries are now installed before running SDC itself, it will take much longer until SDC is available. You can easily check progress with the logs command:

docker logs -f sdc

This allows us to flexibly run any configuration of a containerized StreamSets Data Collector, installing the necessary stage libraries on demand.

Summary

We have seen both the manual as well as automatic way of using stage libraries with a dockerized StreamSets Data Collector deployment.

Of course you want to use the automatic approach to produce reproducible instances of SDC.

  • The built time approach allows you to create different configuration of SDC, maybe because you want to run it at different places in your architecture (gateway, datacenter, …).
  • The run time approach allows you to configure the shape of your SDC instance at start time, offering the highest flexibility with the cost of additional start up time.
  • The built time approach on the other hand provides quick start up time and increased stability due to everything “baked” into multiple docker images.

Will ask StreamSets, if they could support the automatic install in the “official” Docker image, so my version would no longer be necessary. But in the meantime it’s a nice thing to have 🙂

Advertisements

Providing Oracle Stream Analytics 12c environment using Docker

The past 2 days I spent some time to upgrade the docker support I have created for Oracle Stream Explorer to work for Oracle Stream Analytics (which is the new Oracle Stream Explorer).

I guess Docker I don’t have to present anymore, it’s so common today!

Preparation

You can find the corresponding docker project on my GitHub: https://github.com/gschmutz/dockerfiles

Due to the Oracle licensing agreement, the Oracle software itself can not be provided in the GitHub project. Therefore it’s also not possible to upload a built image to Docker Hub.

So you first have to download the Java 8 SDK as well as Stream Analytics Runtime using your own OTN login. Download the following 2 artifacts into the oracle-stream-analytics/dockerfiles/12.2.1/downloads folder.

Building the Oracle Stream Analytics Docker Install image

Navigate to the dockerfiles folder and run the buildDockerImage.sh script as root

$ sh buildDockerImage.sh -v 12.2.1 -A

This will take a while if run for the first time, as it downloads the oracle-linux base image first. At the end you should see a message similar to the one below:

  WebLogic Docker Image for 'standalone' version 12.2.1 is ready to be extended: 
    
    --> gschmutz/oracle-osa:12.2.1-standalone

  Build completed in 171 seconds.

It indicates that the OSA base docker image has been built successfully.

Be aware: this image is not yet executable, it only contains the software without any domain.

Building a Oracle Stream Analytics Standalone domain

In order to use Oracle Stream Analytics, we have to build a domain. This can be done using Docker as well, extending the Oracle Stream Analytics image created above and creating an OSA domain. Currently there is one sample Dockerfile available in the samples folder which creates an Oracle Stream Analytics Standalone domain. In the future this will be enhanced with a domain connecting to Spark.

To build the 12.2.1 standalone domain, navigate to folder samples/1221-domain and run the following command (use the OSA_PASSWORD parameter to specify the OSA user password):

$ docker build -t 1221-domain --build-arg OSA_PASSWORD=<define> .

There are other build arguments you can use to overwrite the default values of the Oracle Stream Analytics Standalone domain. They are documented in the GitHub project here.

Verify you now have this image in place with:

$ docker images

Running Oracle Stream Analytics server

To start the Oracle Stream Analytics server, you can simply call docker run -d 1221-domain command. The sample Dockerfile defines startwlevs.sh as the default CMD.

$ docker run -d --name=osa -p 9002:9002 1221-domain

Check the log by entering

$ docker logs -f osa

After a couple of seconds, the OSA server should be up and running and you can access the Oracle Stream Analytics Web Console at http://localhost:9002/sx.

Connect with user osaadmin and the password you specified above.

Providing Oracle Stream Explorer environment using Docker

In the past week I have been experimenting with installing Oracle Stream Explorer into a Docker container, in order to simplify provisioning development/show case environments with a single docker run command. 

Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud.

You can find the corresponding docker project on my GitHub: https://github.com/gschmutz/docker-oracle-sx

Due to the Oracle licensing agreement, the Oracle software itself can not be provided in the GitHub project. So you first have to download the Java 7 SDK as well as Stream Explorer Runtime and the Stream Explorer User Experience using your own OTN login. Download the 3 artefacts into the downloads subfolder. 

After downloading these files into the downloads folder, you are ready to build the Docker image:

cd docker-oracle-sx
docker build -t gschmutz/docker-oracle-sx:12.1.3 .

This will take a while if run for the first time, as it downloads the oracle-linux base image first. At the end you should see a “Successfully build xxxxxxxx” message, indicating that the docker image has been built successfully.

Unfortunately the domain creation wizard cannot be run automatically, therefore the domain has been pre-created and is provided by the docker-oracle-sx project. This domain is named sx_domain and is copied into the docker image when building the container. 

Now let’s run the container:

docker run -d -p 9002:9002 gschmutz/docker-oracle-sx:12.1.3

With the -p option we are exposing the port 9002 from the docker container to the host machine. With that Oracle Stream Explorer is available under http://docker-host-ip:9002/sx. Connect with user wlevs and password welcome1

NewImage

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Oracle Event Processing Console is available under http://docker-host-ip:9002/wlevs.  

Happy stream exploring 🙂