StreamSets Data Collector – New Package Manager in Action


Update 25.10.2016: added link to new blog article showing how to install additional stage libraries when working with a dockerized instance of StreamSets.

A few days ago, a new  version of StreamSets Data Collector v2.1.0.0 has been announced. It contains a couple of new interesting features.

This blog article shows how to use the new Package Manager to install only the stage libraries in StreamSets Data Collector (origins, processors and destinations) you actually need.

The new StreamSets version provides two distributions:

  • a full version as with the previous versions, containing everything and
  • a smaller customizable download, with only the core version of Data Collector (currently only available as a tarball and with the Docker image, package options will be added later)

After installation of the core version, only a few of the stage libraries are available:

streamsets-core

The customization of the core Data Collector package can be done in 3 ways:

  1. Using the User Interface
  2. Using the CLI
  3. Using the RESTful API

Let’s see them in action….

1. Installing using the User Interface

The user interface contains a new icon in the menu bar for opening the Package Manager:

streamsets-package-manager

A new screen shows the available stage libraries, which can be installed. You can also see which of the libraries are (pre-)installed.

streamsets-packages

Let’s add the latest Kafka library, by selecting the Apache Kafka 0.10.0.0 item and then click on the + icon to install.

streamsets-packages-install

After the install finishes, you have to restart StreamSets Data Collector and Kafka will be available.

2. Installing using the CLI

The second option is using the stagelibs command to install the additional libraries to use.

Note: The stagelibs command requires that curl version 7.18.1 or later and sha1sum utilities are installed on the machine. Verify that these utilities are installed before running the command.

The available libraries can be found in the StreamSets documentation or by using the following command from the command line:

$SDC_DIST/bin/streamsets stagelibs -list

This provides a list of all available stage libraries and also whether they are already installed.

streamsets-libraries-list.jpg

To install one or more stage libraries, use the following command from the command line, here  again to install the Kafka 0.10.x library:

$SDC_DIST/bin/streamsets stagelibs \ 
                   -install=streamsets-datacollector-apache-kafka_0_10-lib

Use the full name of the libraries that you want to install, as shown in the list above. You can install multiple libraries, separating them with commas. Do not include spaces in the command.

You can also use the stagelibs command to generate a script reflecting the necessary install commands to replicate an existing StreamSets instance.

$SDC_DIST/bin/streamsets stagelibs -installScript

3. Installing using the RESTful API

The third and last option I discovered by accident. StreamSets RESTful API can be used to manage the stage libraries as well.

You can reach the documentation page of the API by selecting the RESTful API item available in the help menu.

streamsets-open-restful

On the Data Collector RESTful API overview page, navigate to the definitions operations.

streamsets-overview-restful.jpg

There are several operations, one for listing available stage libraries and one for installing/uninstalling stage libraries.

To install the Kafka 0.10.x library, click on the /v1/stageLibraries/install operation and add the name of the full name of the stage library into the body field:

streamsets-install-lib-restful

The body has to be a JSON array, therefore you have to provide the name of the stage library as a string and inside brackets. Of course you can also specify multiple libraries. Click on try it out to install the additional library. The service should answer with a response code 200. In order to activate the library, StreamSets Data Colector has to be restarted.

Summary

In this article I have presented 3 options for installing additional stage libraries on the  StreamSets core distribution.

I really like the idea of a StreamSets core and being able to customize the additional libraries as needed. This simplifies it from a usability perspective, where now a user only sees what he should use. So in case of Kafka there might be only one or two versions installed and visible, instead of 4 (currently with the full version). I hope that this also reduces the runtime footprint of StreamSets, which of course is important if running it on IoT Gatway type of hardware, such as a Raspberry Pi.

So which one of the 3 options should be used?

This of course depends on your requirements. The first version is a pure manual approach, whereas the 2nd and 3rd can be used to automate the deployment.

This blog article shows how to handle additional stage libraries when provisioning StreamSets through Docker. With StreamSets 2.1.0.x, the docker image is no longer a full distribution, but  based on the core tarball only.