Last week in Stream Processing & Analytics 2/21/2016


This is the second installment of my blog series around Stream Processing and Analytics. I will now publish by the end of the week (on Sunday evening or Monday evening) that’s why the title “Last week in …” is more appropriate😉

This might not surprise you, but I just realized, that it’s just not feasible to have the intention to cover every single news/blog article which is new in this week. I also decided that I will not cover any link to something which is older than the last 7 days (with one or two exceptions where I just thought the content is to valuable). I’m also not covering all possible products/frameworks which are out there. The idea is to concentrate mainly on the innovation in the open source space around “Streaming Analytics”, but to also cover some products from commercial vendors such as Oracle and IBM.

There was the Spark Summit East last week and there will be some interesting new features in the area of Spark Streaming in Spark 2.0, notably Structured Streaming which unifies streaming, interactive and batch query supporting SQL queries on streaming data. There was the possibility to follow the live streaming of the presentation from remote, which I enjoyed the 2nd day, but unfortunately the presentations and videos from the event are not yet available. I will include them in my next weeks post.

Kafka Connect has been officially announced last week, a new feature in Kafka 0.9+ that makes building and managing stream data pipelines easier, especially in the area of data capture. It supports the data integration part of the Kafka Stream Data Platform. There is now quite a variety of ways for handling the data ingestion part of a data processing system, such as Flume, Apache NiFi, StreamSets and Kafka Connect, complementary in some areas, overlapping in others. It’s definitely an area worth investigating and controlled data ingestion becomes even more important in the world of Internet of Things (IoT).

Last week Cloudera  announced their support of Kafka 0.9 with their Release 2.0 of Kafka’s distribution. At the same time they also published an interesting article on the Cloudera VISION blog, highlighting the maturity and importance of Kafka in modern data processing infrastructure: “While Kafka remains a young technology in the now 10-year-old Hadoop ecosystem, it has unequivocally reached the point of being enterprise-grade software, suitable for mission critical deployments”.

Last but not least two new projects have been announced:

  • Apache Arrow, a new open source project with the goal to deliver “an industry-standard columnar in-memory data layer enables users to combine multiple systems, applications and programming languages in a single workload without the usual overhead” according to Ted Dunning.
  • IBM Quarks, an open source development tool that makes it easier for developers to create Internet of Things (IoT) applications to analyze data on the edge of their networks.

News and Blog Posts

General

Comparisons

Apache Storm

Apache Spark Streaming

Apache Samza

Apache Flink

Apache NiFi / Hortonworks DataFlow

Apache Kafka / Confluent Platform

StreamSets

Microsoft Azure Stream Analytics

IBM Bluemix

IBM Quarks

Apache Arrow

New Presentations

New Videos

New Releases / Components

Upcoming Events

Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!