This is the 17th installment of my blog series around Stream Processing and Analytics.
I really liked Darryl Taft’s article on 10 Best Practices for Managing Modern Data in Motion, where she lists 10 tips for managing data in motion. I think all of them are important, here are my 5 favorite ones:
- Replace Specifying schema with capturing intent: An intent-driven focus on big data helps decrease the effort and time needed to develop and implement pipelines.
- Sanitize before Storing: Sanitizing data as close to the source as possible makes data scientist more productive.
- Expect and deal with Data Drift: Implementing the rights kinds of tools and processes can help mitigate the effects on data drift.
- Don’t just count packages, inspect the contents: Analyzing the value of your data can be more important than just measuring throughput and latency.
Decouple for Continual Modernization: Decoupling the stages of data movement allows you to upgrade each as you see fit.
As usual, find below the new blog articles, presentations, videos and software releases from last week:
News and Blog Posts
- 10 Best Practices for Managing Modern Data in Motion by Darryl K. Taft
- Merging Batch and Stream Processing in a Post Lambda World by Alex Woodie
- Make Your Data Strategy Work Through Streaming by Morgan Friberg
- Event Stream Processing or Complex Event Processing? by Jules Oudmans
- Lambda Complexity, Fast Data, New Thinking by John Hugg
- Comparison of Event Sourcing with Stream Processing by Jan Senberg
- Streaming Transactional Data into MapR Streams using Oracle GoldenGate for Big Data by Issam Hijazi
- Thoughts on Stream Processing Engines by Praveen Seluka
- Debugging an Apache Storm topology by Taylor Goetz
- What’s New in Apache Storm 1.0 – Part 1 – Enhanced Debugging by Taylor Goetz
- Had it with Apache Storm? Heron swoops to the rescue by Ian Pointer
Apache Spark Streaming
- Microsoft announces major commitment to Apache Spark by Tiffany Wissner
- Spark Streaming: Exploration of Dynamic Batch Sizing by dtsparkblog
- Spark Streaming: Deepgoing Dynamic Batch Sizing and Analyzation on RateController by dbsparkblog
- Kafka Streams – how does it fit the stream processing landscape? by Adam Warski
- Log Compaction | Highlights in the Apache Kafka and Stream Processing Community | June 2016 by Guozhang Wang
- Confluent Platform 3.0 Supports Kafka Streams for Real-Time Data Processing by
- Write An Apache Kafka Custom Partitioner by howtoprogram
- Kafka streaming gets a new twist by Jack Vaughan
Apache NiFi / Hortonworks HDF
- Apache NiFi Not From Scratch by Gary Stiehr
- Analyzing Salesforce Data with StreamSets, Elasticsearch, and Kibana by Pat Patterson
- Announcing Data Collector ver 188.8.131.52 by Kirit Basu
- Apache Quarks, Watson, and Streaming Analytics: Saving the world, one smart sprinkler at a time by Samantha Chan
- Apache nifi – better analytics demands better data flow by Abhishek Solanki
- Stream All Things – Real-time Data Integration at Scale with Apache Kafka by Gwen Shapira
- Watermarks – Measuring Time and Progress in Streaming Pipelines by Slava Chernyak
- The Evolution of Massive-Scale Data Processing by Tyler Akidau
- Spark streaming: Best Practices by Prakash Cockalingam
- Stream Processing – Key Driver for Enabling Instant Insights on Big Data by Mohit Jotwani
- Data Economy – Where’s Big Data Technology Headed by Data Economy
- Evolving Your Big Data Use Cases from Batch to Real-Time by Steve Abraham
- So You Think You Can Stream – Use Cases and Design Patterns for Spark Streaming by Prakash Cockalingam & Vida Ha
- Real-time fraud detection using process mining with Spark streaming by Bolke de Bruin and Hylke Hendriksen
- 6/6/2016 (San Francisco, US) – Spark Meetup At Spark Summit (Meetup)
- 7/6/2016 (online) – Confluent Control Center – Webinar (Confluent Webinar)
- 6/8/2016 (online) – dW Open Tech Talk: Apache Quarks (Webcast)
- 6/9/2016 (Paris, FR) – Building a Real-time Streaming Platform Using Kafka Streams and Kafka Connect (Meetup)
- 6/13/2016 (Amsterdam, NL) – GOTO Night: Stream Processing with Apache Flink and Mining Github (Meetup)
- 6/13/2016 (New York, US) – Apache Beam (Stream Processing @ Scale Track at QCon New York)
- 6/14/2016 (Lausanne, CH) – Apache Kafka A high-throughput distributed messaging system (Meetup)
- 6/15/2016 (Mountain View, US) – Stream Processing Meetup @ LinkedIn (Meetup)
- 6/15/2016 (London, UK) – Hortonworks Dataflow (HDF) Meetup London (Meetup)
- 6/16/2016 (Garden City, US) – Moving data gracefully with Apache NiFi (Meetup)
- 6/27/2016 (San Jose, US) – Building Big Data applications with Apache Beam and Apache Apex (Meetup)
- 7/5/2016 (San Francisco, US) – Building (and running) Netflix’s Data Pipeline using Apache Kafka (Meetup)
- 8/18/2016 (New York, US) – Apache NiFi – MiNiFi: Taking Dataflow Management to the Edge (Meetup)
Please let me know if that is of interest. Please tweet your projects, blog posts, and meetups to @gschmutz to get them listed in next week’s edition!