Last week in Stream Processing & Analytics 6/28/2016

This is the 20th installment of my blog series around Stream Processing and Analytics.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Apache Flink

Apache Spark Streaming

Apache Kafka

Apache Beam / Google Dataflow

Apache NiFi

StreamSets

New Presentations

New Videos

New Releases

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

Last week in Stream Processing & Analytics 6/21/2016

This is the 19th installment of my blog series around Stream Processing and Analytics.

Last week all the presentation from the Spark Summit have been made available on YouTube. I have included the Stream Processing related talk in the links section below.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Comparison

Apache Storm

Heron

Apache Flink

Apache Spark Streaming

Apache Kafka

Apache Beam / Google Dataflow

StreamSets

Concord

Microsoft Stream Analytics

New Presentations

New Videos

New Releases

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

Last week in Stream Processing & Analytics 6/13/2016

This is the 18th installment of my blog series around Stream Processing and Analytics.

There were two conferences last week with quite a lot of talks around stream processing: the Spark Summit in San Francisco and the Berlin Buzzwords.
Berlin Buzzwords did a good job in recording the sessions and all of them are already available and the ones talking about Stream Processing listed below.

Last week I have done some work on Oracle Stream Analytics and made the Docker support available.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Comparison

Apache Storm

Apache Flink

Apache Spark Streaming

Apache Kafka

Apache Beam / Google Dataflow

Apache NiFi / Hortonworks HDF

StreamSets

Concord

Oracle Stream Analytics

Microsoft Stream Analytics

New Presentations

New Videos

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

Providing Oracle Stream Analytics 12c environment using Docker

The past 2 days I spent some time to upgrade the docker support I have created for Oracle Stream Explorer to work for Oracle Stream Analytics (which is the new Oracle Stream Explorer).

I guess Docker I don’t have to present anymore, it’s so common today!

Preparation

You can find the corresponding docker project on my GitHub: https://github.com/gschmutz/dockerfiles

Due to the Oracle licensing agreement, the Oracle software itself can not be provided in the GitHub project. Therefore it’s also not possible to upload a built image to Docker Hub.

So you first have to download the Java 8 SDK as well as Stream Analytics Runtime using your own OTN login. Download the following 2 artifacts into the oracle-stream-analytics/dockerfiles/12.2.1/downloads folder.

Building the Oracle Stream Analytics Docker Install image

Navigate to the dockerfiles folder and run the buildDockerImage.sh script as root

$ sh buildDockerImage.sh -v 12.2.1 -A

This will take a while if run for the first time, as it downloads the oracle-linux base image first. At the end you should see a message similar to the one below:

  WebLogic Docker Image for 'standalone' version 12.2.1 is ready to be extended: 
    
    --> gschmutz/oracle-osa:12.2.1-standalone

  Build completed in 171 seconds.

It indicates that the OSA base docker image has been built successfully.

Be aware: this image is not yet executable, it only contains the software without any domain.

Building a Oracle Stream Analytics Standalone domain

In order to use Oracle Stream Analytics, we have to build a domain. This can be done using Docker as well, extending the Oracle Stream Analytics image created above and creating an OSA domain. Currently there is one sample Dockerfile available in the samples folder which creates an Oracle Stream Analytics Standalone domain. In the future this will be enhanced with a domain connecting to Spark.

To build the 12.2.1 standalone domain, navigate to folder samples/1221-domain and run the following command (use the OSA_PASSWORD parameter to specify the OSA user password):

$ docker build -t 1221-domain --build-arg OSA_PASSWORD=<define> .

There are other build arguments you can use to overwrite the default values of the Oracle Stream Analytics Standalone domain. They are documented in the GitHub project here.

Verify you now have this image in place with:

$ docker images

Running Oracle Stream Analytics server

To start the Oracle Stream Analytics server, you can simply call docker run -d 1221-domain command. The sample Dockerfile defines startwlevs.sh as the default CMD.

$ docker run -d --name=osa -p 9002:9002 1221-domain

Check the log by entering

$ docker logs -f osa

After a couple of seconds, the OSA server should be up and running and you can access the Oracle Stream Analytics Web Console at http://localhost:9002/sx.

Connect with user osaadmin and the password you specified above.

Last week in Stream Processing & Analytics 6/6/2016

This is the 17th installment of my blog series around Stream Processing and Analytics.

I really liked Darryl Taft’s article on 10 Best Practices for Managing Modern Data in Motion, where she lists 10 tips for managing data in motion. I think all of them are important, here are my 5 favorite ones:

  1. Replace Specifying schema with capturing intent: An intent-driven focus on big data helps decrease the effort and time needed to develop and implement pipelines.
  2. Sanitize before Storing: Sanitizing data as close to the source as possible makes data scientist more productive.
  3. Expect and deal with Data Drift: Implementing the rights kinds of tools and processes can help mitigate the effects on data drift.
  4. Don’t just count packages, inspect the contents: Analyzing the value of your data can be more important than just measuring throughput and latency.
  5. Decouple for Continual Modernization: Decoupling the stages of data movement allows you to upgrade each as you see fit.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Comparison

Apache Storm

Heron

Apache Spark Streaming

Apache Kafka

Apache NiFi / Hortonworks HDF

StreamSets

Apache Quarks

New Presentations

New Videos

New Releases

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!