Update 25.10.2016: added link to new blog article showing how to install additional stage libraries when working with a dockerized instance of StreamSets.
A few days ago, a new version of StreamSets Data Collector v126.96.36.199 has been announced. It contains a couple of new interesting features.
This blog article shows how to use the new Package Manager to install only the stage libraries in StreamSets Data Collector (origins, processors and destinations) you actually need.
The new StreamSets version provides two distributions:
- a full version as with the previous versions, containing everything and
- a smaller customizable download, with only the core version of Data Collector (currently only available as a tarball and with the Docker image, package options will be added later)
After installation of the core version, only a few of the stage libraries are available:
The customization of the core Data Collector package can be done in 3 ways:
- Using the User Interface
- Using the CLI
- Using the RESTful API
Let’s see them in action….
1. Installing using the User Interface
The user interface contains a new icon in the menu bar for opening the Package Manager:
A new screen shows the available stage libraries, which can be installed. You can also see which of the libraries are (pre-)installed.
Let’s add the latest Kafka library, by selecting the Apache Kafka 0.10.0.0 item and then click on the + icon to install.
After the install finishes, you have to restart StreamSets Data Collector and Kafka will be available.
2. Installing using the CLI
The second option is using the stagelibs command to install the additional libraries to use.
Note: The stagelibs command requires that curl version 7.18.1 or later and sha1sum utilities are installed on the machine. Verify that these utilities are installed before running the command.
The available libraries can be found in the StreamSets documentation or by using the following command from the command line:
$SDC_DIST/bin/streamsets stagelibs -list
This provides a list of all available stage libraries and also whether they are already installed.
To install one or more stage libraries, use the following command from the command line, here again to install the Kafka 0.10.x library:
$SDC_DIST/bin/streamsets stagelibs \ -install=streamsets-datacollector-apache-kafka_0_10-lib
Use the full name of the libraries that you want to install, as shown in the list above. You can install multiple libraries, separating them with commas. Do not include spaces in the command.
You can also use the stagelibs command to generate a script reflecting the necessary install commands to replicate an existing StreamSets instance.
$SDC_DIST/bin/streamsets stagelibs -installScript
3. Installing using the RESTful API
The third and last option I discovered by accident. StreamSets RESTful API can be used to manage the stage libraries as well.
You can reach the documentation page of the API by selecting the RESTful API item available in the help menu.
On the Data Collector RESTful API overview page, navigate to the definitions operations.
There are several operations, one for listing available stage libraries and one for installing/uninstalling stage libraries.
To install the Kafka 0.10.x library, click on the /v1/stageLibraries/install operation and add the name of the full name of the stage library into the body field:
The body has to be a JSON array, therefore you have to provide the name of the stage library as a string and inside brackets. Of course you can also specify multiple libraries. Click on try it out to install the additional library. The service should answer with a response code 200. In order to activate the library, StreamSets Data Colector has to be restarted.
In this article I have presented 3 options for installing additional stage libraries on the StreamSets core distribution.
I really like the idea of a StreamSets core and being able to customize the additional libraries as needed. This simplifies it from a usability perspective, where now a user only sees what he should use. So in case of Kafka there might be only one or two versions installed and visible, instead of 4 (currently with the full version). I hope that this also reduces the runtime footprint of StreamSets, which of course is important if running it on IoT Gatway type of hardware, such as a Raspberry Pi.
So which one of the 3 options should be used?
This of course depends on your requirements. The first version is a pure manual approach, whereas the 2nd and 3rd can be used to automate the deployment.
This blog article shows how to handle additional stage libraries when provisioning StreamSets through Docker. With StreamSets 2.1.0.x, the docker image is no longer a full distribution, but based on the core tarball only.