What Spark's Structured Streaming really means

Thanks to an impressive grab bag of improvements in version 2.0, Spark's quasi-streaming solution has become more powerful and easier to manage

Last year was a banner year for Spark. Big names like Cloudera and IBM jumped on the bandwagon, companies like Uber and Netflix rolled out major deployments, and Databricks' aggressive release schedule brought a brace of improvements and new features. Yet real competition for Spark also emerged, led by Apache Flink and Google Cloud Dataflow (aka Apache Beam).

Flink and Dataflow bring new innovations and target some of Spark's weaker aspects, particularly memory management and streaming support. Spark has not been standing still in the face of this competition, however; big efforts were made last year to improve Spark's memory management and query optimizer.

Moreover, this year will usher in Spark 2.0 -- and with it a new twist for streaming applications, which Databricks calls "Structured Streaming."

Structured Streaming is a collection of additions to Spark Streaming rather than a huge change to Spark itself. In other words, for all of you Spark jockeys out there: The fundamental concept of microbatching at the core of Spark's streaming architecture endures.

Infinite DataFrames

If you've read one of my previous Spark articles or attended any of my talks over the past year or so, you'll have noticed that I make the same point again and again: Use DataFrames whenever you can, not Spark's RDD primitive.

DataFrames get the benefit of the Catalyst query optimizer and, as of 1.6, DataFrames typed with DataSets can take advantage of dedicated encoders that allow significantly faster serialization/deserialization times (an order of magnitude faster than the default Java serializer). Furthermore, in Spark 2.0, DataFrames come to Spark Streaming with the simple concept of an infinite DataFrame. Creating such a DataFrame from a stream is simple:

val records= sqlContext.read.format(“json”).stream(“kafka://KAFKA_HOST”)

(Note: The Structured Streaming APIs are in constant flux, so while the code snippets in this article provide a general idea of how the code will look, they may change between now and the release of Spark 2.0.)

This results in a streaming DataFrame that can be manipulated in exactly the same way as the more familiar batch DataFrame -- using custom user-defined functions (UDFs), for example. Behind the scenes, those results will be updated as new data flows from the stream source. Instead of a disparate set of avenues into data, you'll have one unified API for both batch and streaming sources. And of course, all your queries on the DataFrames will call on the Catalyst Optimizer to produce efficient operations across the cluster.

Repeated queries

That's all well and good, as far as it goes. It makes developing easier between batch and streaming applications. But the real importance of Structured Streaming is Spark's abstraction of repeated queries (RQ). In essence, RQ simply states that the majority of streaming applications can be seen as asking the same question over and over again (for example, "How many people visited my website in the past five minutes?").

It works like this: Users specify a query against the DataFrame in exactly the manner they do now. They also specify a "trigger," which stipulates how often the query should run (the system will make a best effort to ensure this happens). Finally, the user specifies an "output mode" for the repeated queries, together with a data sink for output.

There are four different types of output mode:

Append: The simplest type of output, which appends new records to the data sink.
Delta: Using a supplied primary key, this mode will write a delta log, indicating whether records have been updated, added, or removed.
Update-in-place: Again using a supplied key, this mode will update records in the data sink, overwriting existing data.
Complete: For each trigger, a complete snapshot of the query result is returned.

To see what that looks like in code, consider an example that reads JSON messages from a Kafka stream, creates counts based on a user field every five seconds, then updates a MySQL table with those counts using the same user field as a primary key:

val records=sqlContext.read.format(“json”).stream(“kafka://[KAFKA_HOST]”)

val counts=records.groupBy(“user”).count()

counts.write.trigger(ProcessingTime(“5 sec”)) .outputMode(UpdateInPlace(“user”))

.format(“jdbc”).startStream(“mysql://...”)

Those few lines of code alone do an awful lot of work. You can imagine building a change-data-capture solution by simply changing outputMode to Delta in the above example. Spark 2.x will also provide more powerful options for windowing data in conjunction with these output modes, which should make handling out-of-order event data much easier than in existing Spark Streaming applications.

Querying off the cuff

Structured Streaming has two other features, and while not yet fully fleshed-out, they have the potential to be very useful. The first of these is support for ad-hoc queries.

Let's say you want to know how many people logged into your website in 15-minute intervals over the past two days. Currently, you'd have to set up a new process to talk to the Kafka stream and build up the new query. With Structured Streaming, you'll be able to connect to your already running Spark application and issue an ad-hoc query. This has the potential to bring data scientists much closer to the real-time furnace of data-gathering, which can only be a good thing.

For operators who have to keep all these items running 24/7, the same feature that enables ad-hoc querying will also let the Spark application update its runtime operations dynamically. This can alleviate some of the pain around updating a Spark Streaming application in production today.

When a stream is not a stream

Structured Streaming is an impressive grab bag of improvements to Spark Streaming that will assist in common application patterns like ETL and CDC processes. The current plan is that Spark 2.0 will have the initial API implementation, marked as "experimental" and with support for Kafka, while other sources and sinks (plus the ad-hoc/dynamically updating queries) will be rolled out across the 2.x series. We’re likely to see Spark 2.0 released in May.

Many of you may point out that while all this is great, it doesn't resolve the key failing of Spark Streaming in comparison to Apache Beam or Apache Flink -- that it's based on microbatching rather than a pure streaming solution. This is true! But Structured Streaming decouples the developer from the underlying microbatching architecture, which allows Spark maintainers to potentially swap out that architecture for a pure streaming option in a future, major revision.

We'll have to see what happens on the road to Spark 3.0. But the additions to 2.0 will likely help Spark remain in a commanding position in 2016.

Next read this:

Ian Pointer is a senior big data and deep learning architect, working with Apache Spark and PyTorch. He has more than 15 years of development and operations experience.