Introducing Spark Streaming

Apache Spark Streaming + Kafka 0.10 (2/4)

Billy Mobile
billyengineering

--

by Joan Viladrosa, Samuel Lissner.

This story is part of a series of articles related to the integration of Apache Spark Streaming and Apache Kafka. Feel free to check the other ones here.

What is Apache Spark Streaming?

Apache Spark is the trending big data technology right now. It was disruptive to the established Big Data and Data Science ecosystem. And sure, it possesses amazing capabilities.

Spark Streaming enables stream processing for Apache Spark’s language-integrated API. That is: Letting you write streaming jobs the same way you write batch jobs. This is the main advantage. Spark streaming relies on the same API, integrations, guarantees and semantics as Spark batch processing. As programming languages it supports Java, Scala and Python.

How does it work?

Spark Streaming is built upon a micro-batch approach. It discretizes the input data stream into batches of data. The Spark Engine continues to process the well-known basic data structure of Spark: Resilient Distributed Datasets (short RDDs), which are just distributed chunks of the original data. There are also some other amazing features provided out of the box, like windowing functions, but those are out of the scope of this post.

So right now we have those discretized streams. We can apply all transformations to them like we would do conventionally. Because, under the hood, the Spark Streaming engine will apply those transformations to each RDD of the stream.

What really makes Spark Streaming really cool compared to other Streaming Architectures is task balancing and recovery in case of node failure. (The balancing of tasks among nodes on a partition unbalanced stream.) The latter, although really cool, has some side effects, as we cannot guarantee exactly-once semantics for output actions. Probably you already faced this kind of problems in batch applications. In streaming applications the problem turns out to be more notorious since your application probably will be running 24/7 non-stop.

How does Spark Streaming’s Kafka Integration work?

As we said, a lot has changed over the time, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable. Here a quick overview over the changelog:

Original Approach with Receivers

The original approach to integrate Kafka with Spark was through receivers. That was basically an Apache Kafka high-level API Consumer instantiated on each Spark executor. They were receiving the messages asynchronously and stored them into the executor memory, while updating the offsets in zookeeper. Later on, the Spark Streaming Driver launched jobs on the data already received. Although this approach works, it had very poor fault-tolerance guarantees. Messages were lost on every restart, or node failures, etc.

Original approach with receivers

Fault tolerant Write-Ahead Log

To solve this issues, the next version (just a few months later) included a WAL (Write-Ahead Log). The main idea here is to write all the information, that the receiver receives on each executor, into a durable filesystem like HDFS (and not only to memory). In case of failure or loss of memory data, we can recover the state by consulting this WAL.

Fault tolerant WAL

If we take a closer look, we see that not only the actual data is written into the log. We also find the metadata of the receivers that the Streaming Context of the Driver handles. This approach, with the already existent checkpointing (which is almost the same idea applied to the spark context), leads us to a fault-tolerant solution that guarantees no data loss.

But taking into consideration that Kafka has already its own retention for recovery, this solution seems to be a bit overkill. We could consult the lost information from Kafka’s Write-Ahead Log. It makes little sense to store and process all the information again in another durable filesystem. Also complexity is augmented through inflated amount of dependencies (the more you count, the more can fail). All this lead to a total new approach that both simplifies the usage and improves the performance and fault-tolerance.

Storing metadata and data to WAL
Recovering metadata and data from WAL

Direct Streams or Kafka Direct Stream Integration

With Apache Spark 1.3 arrived a new approach of streaming integrations without the need of receivers. Also Kafka offers a new and simplified way to interact with Spark Streaming. The main shift was to use the “simple” API instead of the high level API, and taking advantage of the core feature of Kafka: it’s a durable log with custom retention. It is no longer necessary to store Kafka data in memory or add an additional Write-Ahead Log, because we can just ask the data directly from Kafka.

So, the first step runs on the Spark Streaming driver. It asks Kafka for the latests offsets on each partition of the topic. With this information (topic, partition, offset start and offset end), the driver launches deterministic and repeatable jobs on the executors in the right order. Then each executor will process the tasks, while retrieving the data from Kafka. If a job fails due to node failure, it does not matter, we have not lost anything. Just re-schedule the task on another executor and retrieve the data again from Kafka.

You may notice that the Zookeeper dependency has gone. Since we are no longer using the High Level API, there is no need for a coordinator of the consumer group. You will still be in charge of storing and keeping those offsets safe in case of Driver failure or Application restart, so probably you will keep storing them in ZooKeeper or some other place, but at least we reduced the dependencies of the basic use case.

Kafka Direct Stream using Simple API

Other Streaming Improvements

Over the versions, Spark Streaming UI received some improvements, with 1.4 adding graphics and histograms on how our application is handling the batches, and with 1.5 including metadata of the Direct Receivers.

We won’t discuss in detail all those capabilities, but it’s always a good thing know your application capabilities. So 👍 to improvements about monitoring on Spark Streaming Applications.

Timelines and Histograms from Spark 1.4+
Details of each Spark Streaming micro batch

Enough theory! In our next article we will present some code examples and a small demo on how to tie together Spark Streaming and Apache Kafka in practice.

--

--

With headquarters in Barcelona, and a regional office in Singapore, Billy Mobile is the leading affiliate platform for Mobile Advertising.