Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

gschmutz 21:13 on June 17, 2016
Tags: flink ( 92 ), heron ( 23 ), kafka ( 248 ), spark-streaming ( 219 ), storm ( 46 ), Stream Processing ( 241 ), streaming-analytics ( 84 ), streamsets ( 74 )

2016

This is the 19th installment of my blog series around Stream Processing and Analytics.

Last week all the presentation from the Spark Summit have been made available on YouTube. I have included the Stream Processing related talk in the links section below.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Realtime Data Processing at Facebook by Guoqiang Jerry Chen et al.
Thoughts on Edge Processing for the IoT by David Linthicum

Comparison

A Comparison of Big Data Frameworks on a Layered Dataflow Model by Claudia Misale, Maurizio Drocco, Marco Aldinucci, Guy Tremblay

Apache Storm

How to Deploy Apache Storm on AWS with Storm-Deploy by Chandan Patra

Heron

Getting started with Heron stream processing engine in Ubuntu 14.04 by Pulasthi Supun Wickramasinghe
Setting up Heron Cluster with Apache Aurora Locally by Pulasthi Supun Wickramasinghe

Apache Flink

How We Added Windowing to the Apache Flink Batch Runner by Aljoscha Krettek

Apache Spark Streaming

Spark Streaming and Azure Stream Analytics by Zhong Chen
Lean Approach to Spark Streaming with AWS EC2 by Antoine Galataud

Apache Kafka

Building a Streaming Analytics Stack with Apache Kafka and Druid by Fangjin Yang
Apache Kafka and Kafka Streams at Berlin Buzzwords by Michael Noll
IOT: Kafka to MQTT bridge using Mosca by Robert Fuller
How Apache Kafka and MapR Streams Handle Topic Partitions by Ellen Friedman
Large-Scale Stream Processing with Apache Kafka by Ralph Winzinger

Apache Beam / Google Dataflow

The first release of Apache Beam! by Davor Bonaci
Dataflow updates: See more details about your pipelines by Alex Amato & Scott Wegner

StreamSets

Visualize StreamSets Data Collector Metrics with Datadog by Pat Patterson

Concord

Real-time Stream Processing on DC/OS with Concord by Shinji Kim

Microsoft Stream Analytics

Spark Streaming and Azure Stream Analytics by Zhong Chen

New Presentations

Ingest and Stream Processing – What will you choose? by Ted Malaska & Pat Patterson
Producer Performance Tuning for Apache Kafka by Jiangjie Qin
Air traffic controller by Cameron Lee & Shubhanshu Nagar
Scalable complex event processing on samza by Shuyi Chen
Type safe, versioned, and rewindable stream processing with Apache {Avro, Kafka} and Scala by Hisham Mardam-Bey
The Netflix Way To Deal With Real-Time Data by Peter Bakas
Streaming ETL for All by Joey Echeverria
Scalable complex event processing on samza @UBER by Shuyi Chen
Introduction to apache kafka by Samuel Kerrien
Arguments for a Unified IoT Architecture by VoltDB
Reactive Streams, Linking Reactive Application To Spark Streaming by Luc Bourlier
Huawei Advanced Data Science With Spark Streaming by Jianfeng Qian & Cheng He
Data in Motion: Streaming Static Data Efficiently by Martin Zapletal
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farchtchi
Temporal Operators For Spark Streaming And Its Application For Office365 Service Monitoring by Jin Li
Top 5 Lessons Learned in Building Streaming Applications at Microsoft Bing Scale by Renyi Xiong
Kafka 0.9, Things you should know by Ratish Ravindran
Change Data Capture using Kafka by Akash Vacher
Kafka overview and use cases by Indrajeet Kumar
The Internet of Everywhere—How IBM The Weather Company Scales by Robbie Strickland
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector by Guglielmo Iozzia
Running Kafka at Scale by Gwen Shapira

New Videos

Robust Stream Processing with Apache Flink by Jamie Grier
Airstream: Spark Streaming At Airbnb by Liying Tang & Jingwei Lu
Structuring Spark: Dataframes, Datasets And Streaming by Michael Armbrust
A Deep Dive Into Structured Streaming by by Tathagata Das
Huawei Advanced Data Science With Spark Streaming by Jianfeng Qian & Cheng He
Five Lessons Learned In Building Streaming Applications At Microsoft Bing Scale by Renyi Xiong
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Guozhang Wang
PowerStream: Propelling Energy Innovation with Predictive Analytics by Nikita Shamgunov
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video Streaming Using Apache by Jibin Zhang
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farchtchi
Reactive Streams, Linking Reactive Application To Spark Streaming by Luc Bourlier
Temporal Operators For Spark Streaming And Its Application For Office365 Service Monitoring by Jim Li
Top 5 Lessons Learned in Building Streaming Applications at Microsoft Bing Scale by Renyi Xiong
Data in Motion: Streaming Static Data Efficiently in Akka Persistence by Martin Zapletal
Connecting Reactive Applications with Fast Data Using Reactive Streams by Luc Bourlier

New Releases

Apache Beam 0.1.0-incubating

Upcoming Events

6/23/2016 (San Jose, US) – Apache Bigtop & Apache Apex (Meetup)
6/27/2016 (San Jose, US) – Building Big Data applications with Apache Beam and Apache Apex (Meetup)
6/28/2016 (London, UK) – Crowdmix – An Event Based Social Music Platform & Kafka 0.10 New Features (Meetup)
7/5/2016 (San Francisco, US) – Building (and running) Netflix’s Data Pipeline using Apache Kafka (Meetup)
7/6/2016 (Atlanta, US) – StreamSets, For The Coding Minimalist In All of Us (Meetup)
7/14/2016 (Princeton, US) – Apache NiFi (Meetup)
7/14/2016 (San Francisco, US) – Expert Panel on Streaming Analytics Technologies (Meetup)
7/19/2016 (Munich, GE) – Apache Apex: Stream Processing Architecture and Applications (Meetup)
7/19/2016 (Taipei, TW) – Stream Processing with Apache Flink (Meetup)
7/21/2016 (San Francisco, US) – Apache Spark Streaming with Apache NiFi and Apache Kafka (Meetup)
8/18/2016 (New York, US) – Apache NiFi – MiNiFi: Taking Dataflow Management to the Edge (Meetup)

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

gschmutz 20:29 on June 13, 2016
Tags: azure-stream-analytics ( 7 ), beam ( 25 ), flink ( 92 ), kafka ( 248 ), oracle stream analytics ( 11 ), spark-streaming ( 219 ), storm ( 46 ), Stream Processing ( 241 ), streaming-analytics ( 84 ), streamsets ( 74 )

Last week in Stream Processing & Analytics 6/13/2016

This is the 18th installment of my blog series around Stream Processing and Analytics.

There were two conferences last week with quite a lot of talks around stream processing: the Spark Summit in San Francisco and the Berlin Buzzwords.
Berlin Buzzwords did a good job in recording the sessions and all of them are already available and the ones talking about Stream Processing listed below.

Last week I have done some work on Oracle Stream Analytics and made the Docker support available.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

Real-Time Analytics: Six Steps For Fast, Precise Decision-Making by Roy Schulte
Beyond Real-time Data Applications – Whiteboard Walkthrough by Ellen Friedman

Comparison

Top 18 Open Source and Commercial Stream Analytics Platforms by predictiveanalyticstoday

Apache Storm

Getting Started With Azure HDInsight: Create Apache Storm Cluster by Kuppurasu Nagaraj

Apache Flink

Data Streaming Architecture with Apache Flink by Srini Penchikala

Apache Spark Streaming

Spark Summit keynote explores structured streaming, innovation in deep learning by Marlene Den Bleyker
Spark Streaming Testing with Scala Example by Todd McGrath

Apache Kafka

Kafka Connect Sink for PostgreSQL from JustOne Database by Duncan Pauly
Interest grows in Apache Kafka, edge of network for IoT by Marlene Den Bleyker
‘Real-Time’ Online Machine Learning with Apache Kafka and Sklearn [plus other buzz words] by Bryan Travis Smith

Apache Beam / Google Dataflow

Understanding timing in Cloud Dataflow pipelines by Robert Burke

Apache NiFi / Hortonworks HDF

Testing ExecuteScript processor scripts by Matt Burgess
Data Injestion Apache Nifi for Data Collection by Don Jernigan

StreamSets

Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets by Pat Patterson

Concord

Concord Leverages Mesos for High Performance Stream Processing by Susan Hall

Oracle Stream Analytics

Providing Oracle Stream Analytics 12c environment using Docker by Guido Schmutz
Real-Time Data Streaming & Exploration with Oracle GoldenGate & Oracle Stream Analytics by Issam Hijazi
Accessing and Analyzing Twitter Feeds with Oracle Stream Analytics (Part 1) by John Featherly

Microsoft Stream Analytics

How To Build a Real-Time IoT Dashboard with Azure IoT Hub, Azure Stream Analytics and Power BI by John Gallant

New Presentations

A Data Streaming Architecture with Apache Flink by Robert Metzger
A Deep Dive into Structured Streaming by Tathagata Das
So you think you can Stream? by Parkash Chockalingam
Structuring Spark: Dataframes, Datasets And Streaming by Michael Armbrust
The Stream Processor as the Database by Stephan Ewen
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming by Guozhang Wang
Spark streaming , Spark SQL by Jerry Jung
Streaming Analytics & CEP – Two sides of the same coin? by Till Rohrmann
Continuous Processing with Apache Flink by Stephan Ewen
Graphs as Streams: Rethinking Graph Processing in the Streaming Era by Vasia Kalavri
Architectual Comparison of Apache Apex and Spark Streaming by Thomas Weise
Application development and data in the emerging world of stream processing by Neha Narkhede
Introducing Kafka Streams, the new stream processing library of Apache Kafka by Michael Noll

New Videos

Fundamentals of stream processing with Apache Beam by Tyler Akidau
What we’ve learned from developing a stream processing platform at scale by Amit Sela
Architectual Comparison of Apache Apex and Spark Streaming by Thomas Weise
An Introduction to Azure Stream Analytics by Ashish Bhatia
Data Streaming for Connected Devices with Azure Stream Analytics by Juan Manuel Servera
Twitter Heron at Scale by Hakka Labs
Streaming Analytics & CEP – Two sides of the same coin? by Till Rohrmann
Scaling Yelp’s Logging Pipeline with Apache Kafka by Enrico Canzonieri
Real Time Marketing with Kafka, Storm, Cassandra and a pinch of Spark by Volker Janz
Application development and data in the emerging world of stream processing by Neha Narkhede
Help I need a stream processor – learning how to chose between Spark, Flink, Samza, and Storm by Andrew Psaltis
Graphs as Streams: Rethinking Graph Processing in the Streaming Era by Vasia Kalavri
Google Dataflow: The new open model for batch and stream processing by Felipe Hoffa
SMACK Stack – Data done Right by Stefan Siprell
A Data Streaming Architecture with Apache Flink by Robert Metzger
Leveraging blockchain technologies for the internet of things by Sam Bessalah
Introducing Kafka Streams, the new stream processing library of Apache Kafka by Michael Noll
Fast Analytics on Fast Data by Todd Lipcon
Fast Cars, Big Data – How Streaming Can Help Formula 1 by Ted Dunning

Upcoming Events

6/13/2016 (Amsterdam, NL) – GOTO Night: Stream Processing with Apache Flink and Mining Github (Meetup)
6/13/2016 (New York, US) – Apache Beam (Stream Processing @ Scale Track at QCon New York)
6/13/21016 (Dublin, IR) – Production Quality Data Science. Building Rapid Ingestion Data Pipelines (Meetup)
6/14/2016 (Garden City, US) – Moving data gracefully with Apache NiFi
6/14/2016 (Lausanne, CH) – Apache Kafka A high-throughput distributed messaging system (Meetup)
6/15/2016 (Mountain View, US) – Stream Processing Meetup @ LinkedIn (Meetup)
6/15/2016 (London, UK) – Hortonworks Dataflow (HDF) Meetup London (Meetup)
6/16/2016 (Garden City, US) – Moving data gracefully with Apache NiFi (Meetup)
6/21/2016 (San Jose, US) – Apache Bigtop & Apache Apex (Meetup)
6/21/2016 (online) – A New Paradigm for Managing Data in Motion (Webinar)
6/22/2016 (online) – Comparing Open Source Big Data Ingest Options (Streamsets Webinar)
6/27/2016 (San Jose, US) – Building Big Data applications with Apache Beam and Apache Apex (Meetup)
7/5/2016 (San Francisco, US) – Building (and running) Netflix’s Data Pipeline using Apache Kafka (Meetup)
7/6/2016 (Atlanta, US) – StreamSets, For The Coding Minimalist In All of Us (Meetup)
7/14/2016 (Princeton, US) – Apache NiFi (Meetup)
7/19/2016 (Munich, GE) – Apache Apex: Stream Processing Architecture and Applications (Meetup)
7/19/2016 (Taipei, TW) – Stream Processing with Apache Flink (Meetup)
8/18/2016 (New York, US) – Apache NiFi – MiNiFi: Taking Dataflow Management to the Edge (Meetup)

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!

gschmutz 15:42 on June 12, 2016
Tags: docker ( 3 ), oracle stream analytics ( 11 ), osa ( 3 ), streaming-analytics ( 84 )

Providing Oracle Stream Analytics 12c environment using Docker

The past 2 days I spent some time to upgrade the docker support I have created for Oracle Stream Explorer to work for Oracle Stream Analytics (which is the new Oracle Stream Explorer).

I guess Docker I don’t have to present anymore, it’s so common today!

Preparation

You can find the corresponding docker project on my GitHub: https://github.com/gschmutz/dockerfiles

Due to the Oracle licensing agreement, the Oracle software itself can not be provided in the GitHub project. Therefore it’s also not possible to upload a built image to Docker Hub.

So you first have to download the Java 8 SDK as well as Stream Analytics Runtime using your own OTN login. Download the following 2 artifacts into the oracle-stream-analytics/dockerfiles/12.2.1/downloads folder.

Java 8 Server: server-jre-8u91-linux-x64.tar.gz
Oracle Stream Analytics Runtime: ofm_integration_osa_12.2.1.0.0_disk1.zip

Building the Oracle Stream Analytics Docker Install image

Navigate to the dockerfiles folder and run the buildDockerImage.sh script as root

$ sh buildDockerImage.sh -v 12.2.1 -A

This will take a while if run for the first time, as it downloads the oracle-linux base image first. At the end you should see a message similar to the one below:

  WebLogic Docker Image for 'standalone' version 12.2.1 is ready to be extended: 
    
    --> gschmutz/oracle-osa:12.2.1-standalone

  Build completed in 171 seconds.

It indicates that the OSA base docker image has been built successfully.

Be aware: this image is not yet executable, it only contains the software without any domain.

Building a Oracle Stream Analytics Standalone domain

In order to use Oracle Stream Analytics, we have to build a domain. This can be done using Docker as well, extending the Oracle Stream Analytics image created above and creating an OSA domain. Currently there is one sample Dockerfile available in the samples folder which creates an Oracle Stream Analytics Standalone domain. In the future this will be enhanced with a domain connecting to Spark.

To build the 12.2.1 standalone domain, navigate to folder samples/1221-domain and run the following command (use the OSA_PASSWORD parameter to specify the OSA user password):

$ docker build -t 1221-domain --build-arg OSA_PASSWORD=<define> .

There are other build arguments you can use to overwrite the default values of the Oracle Stream Analytics Standalone domain. They are documented in the GitHub project here.

Verify you now have this image in place with:

$ docker images

Running Oracle Stream Analytics server

To start the Oracle Stream Analytics server, you can simply call docker run -d 1221-domain command. The sample Dockerfile defines startwlevs.sh as the default CMD.

$ docker run -d --name=osa -p 9002:9002 1221-domain

Check the log by entering

$ docker logs -f osa

After a couple of seconds, the OSA server should be up and running and you can access the Oracle Stream Analytics Web Console at http://localhost:9002/sx.

Connect with user osaadmin and the password you specified above.

gschmutz 08:43 on June 7, 2016
Tags: heron ( 23 ), kafka ( 248 ), nifi ( 238 ), spark-streaming ( 219 ), storm ( 46 ), Stream Processing ( 241 ), streaming-analytics ( 84 ), streamsets ( 74 )

Last week in Stream Processing & Analytics 6/6/2016

This is the 17th installment of my blog series around Stream Processing and Analytics.

I really liked Darryl Taft’s article on 10 Best Practices for Managing Modern Data in Motion, where she lists 10 tips for managing data in motion. I think all of them are important, here are my 5 favorite ones:

Replace Specifying schema with capturing intent: An intent-driven focus on big data helps decrease the effort and time needed to develop and implement pipelines.
Sanitize before Storing: Sanitizing data as close to the source as possible makes data scientist more productive.
Expect and deal with Data Drift: Implementing the rights kinds of tools and processes can help mitigate the effects on data drift.
Don’t just count packages, inspect the contents: Analyzing the value of your data can be more important than just measuring throughput and latency.
Decouple for Continual Modernization: Decoupling the stages of data movement allows you to upgrade each as you see fit.

As usual, find below the new blog articles, presentations, videos and software releases from last week:

News and Blog Posts

General

10 Best Practices for Managing Modern Data in Motion by Darryl K. Taft
Merging Batch and Stream Processing in a Post Lambda World by Alex Woodie
Make Your Data Strategy Work Through Streaming by Morgan Friberg
Event Stream Processing or Complex Event Processing? by Jules Oudmans
Lambda Complexity, Fast Data, New Thinking by John Hugg
Comparison of Event Sourcing with Stream Processing by Jan Senberg
Streaming Transactional Data into MapR Streams using Oracle GoldenGate for Big Data by Issam Hijazi

Comparison

Thoughts on Stream Processing Engines by Praveen Seluka

Apache Storm

Debugging an Apache Storm topology by Taylor Goetz
What’s New in Apache Storm 1.0 – Part 1 – Enhanced Debugging by Taylor Goetz

Heron

Had it with Apache Storm? Heron swoops to the rescue by Ian Pointer

Apache Spark Streaming

Microsoft announces major commitment to Apache Spark by Tiffany Wissner
Spark Streaming: Exploration of Dynamic Batch Sizing by dtsparkblog
Spark Streaming: Deepgoing Dynamic Batch Sizing and Analyzation on RateController by dbsparkblog

Apache Kafka

Kafka Streams – how does it fit the stream processing landscape? by Adam Warski
Log Compaction | Highlights in the Apache Kafka and Stream Processing Community | June 2016 by Guozhang Wang
Confluent Platform 3.0 Supports Kafka Streams for Real-Time Data Processing by Srini Penchikala
Write An Apache Kafka Custom Partitioner by howtoprogram
Kafka streaming gets a new twist by Jack Vaughan

Apache NiFi / Hortonworks HDF

Apache NiFi Not From Scratch by Gary Stiehr

StreamSets

Analyzing Salesforce Data with StreamSets, Elasticsearch, and Kibana by Pat Patterson
Announcing Data Collector ver 1.4.0.0 by Kirit Basu

Apache Quarks

Apache Quarks, Watson, and Streaming Analytics: Saving the world, one smart sprinkler at a time by Samantha Chan

New Presentations

Apache nifi – better analytics demands better data flow by Abhishek Solanki
Stream All Things – Real-time Data Integration at Scale with Apache Kafka by Gwen Shapira
Watermarks – Measuring Time and Progress in Streaming Pipelines by Slava Chernyak
The Evolution of Massive-Scale Data Processing by Tyler Akidau
Spark streaming: Best Practices by Prakash Cockalingam
Stream Processing – Key Driver for Enabling Instant Insights on Big Data by Mohit Jotwani

New Videos

Data Economy – Where’s Big Data Technology Headed by Data Economy
Evolving Your Big Data Use Cases from Batch to Real-Time by Steve Abraham
So You Think You Can Stream – Use Cases and Design Patterns for Spark Streaming by Prakash Cockalingam & Vida Ha
Real-time fraud detection using process mining with Spark streaming by Bolke de Bruin and Hylke Hendriksen

New Releases

Upcoming Events

6/6/2016 (San Francisco, US) – Spark Meetup At Spark Summit (Meetup)
7/6/2016 (online) – Confluent Control Center – Webinar (Confluent Webinar)
6/8/2016 (online) – dW Open Tech Talk: Apache Quarks (Webcast)
6/9/2016 (Paris, FR) – Building a Real-time Streaming Platform Using Kafka Streams and Kafka Connect (Meetup)
6/13/2016 (Amsterdam, NL) – GOTO Night: Stream Processing with Apache Flink and Mining Github (Meetup)
6/13/2016 (New York, US) – Apache Beam (Stream Processing @ Scale Track at QCon New York)
6/14/2016 (Lausanne, CH) – Apache Kafka A high-throughput distributed messaging system (Meetup)
6/15/2016 (Mountain View, US) – Stream Processing Meetup @ LinkedIn (Meetup)
6/15/2016 (London, UK) – Hortonworks Dataflow (HDF) Meetup London (Meetup)
6/16/2016 (Garden City, US) – Moving data gracefully with Apache NiFi (Meetup)
6/27/2016 (San Jose, US) – Building Big Data applications with Apache Beam and Apache Apex (Meetup)
7/5/2016 (San Francisco, US) – Building (and running) Netflix’s Data Pipeline using Apache Kafka (Meetup)
8/18/2016 (New York, US) – Apache NiFi – MiNiFi: Taking Dataflow Management to the Edge (Meetup)

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!