Why Spark 1.6 is a big deal for big data

Already the hottest thing in big data, Spark 1.6 turns up the heat. Here are the high points, including improved streaming and memory management

Big data got off to a roaring start in 2016 with the release of Spark 1.6 last week. You can rely on the Spark team to deliver useful features in a point release, but Spark 1.6 goes a step further, offering a mini-milestone on the way to Spark 2.0.

The new features and improvements In Spark 1.6 will make both developers and operators very happy. Let's take a look at some of the highlights.

Automatic memory management

If you've talked to people who've used Spark in production, you'll often hear them complaining about the hand-tuning required to optimize Spark's memory management. In particular, you can spend days looking at garbage collection traces to tune the static split between execution memory (for shuffles, sorting, and shuffling) and caching for hot data memory locality.

In fairness, the Spark team has recognized the difficulty of memory management in Spark and has improved it immensely with their Project Tungsten work over the past few releases. Spark 1.6, I'm happy to report, adds a new memory manager that automatically adjusts memory usage depending on the current execution plan. This won't solve all your memory-tuning issues, but it will make life easier for many a harried data systems administrator.

Streaming improvements

Spark streaming offers great promise, but it has suffered in comparison to competing frameworks such as Storm. We've seen some large, aggregation-streaming pipelines that simply can't cope with big state changes -- and Spark's updateStateByKey API has to hold the entire working set in memory.

Spark 1.6 adds a new API, mapWithState, that works with deltas instead of the entire working set, immediately offering speed and memory improvements over the previous approach. If your experiments with Spark Streaming failed in the past due to issues with updateStateByKey, I encourage you to re-examine the issue. Databricks reports up to 10-fold speed improvements over Spark 1.5.

ML persistence

MLLib is becoming an amazing compilation of machine learning tools and algorithms. Spark 1.6 adds to the collection, with highlights that include improvements to k-means clustering, model summary statistics, and more R support (have a look at the release notes for the whole list).

The best new feature added to MLLib in this release is Pipeline persistence. While you've been able to persist models in previous versions, Spark 1.6 allows you to import and export the workflow of Estimators, Transformers, Models, and Pipelines from and to external storage. This should reduce the amount of code required in a production environment to set up an ML Pipeline, as well as greatly ease experimentation (because you can swap out the entire Pipeline by simply loading a different workflow). I'm excited about using this feature in the near future.

Datasets and the road to Spark 2.0

Perhaps the most important new addition to Spark 1.6 is Datasets, which are typed extensions that bring RDD-like operations to Dataframes. In many ways, you can consider them as a bridge for the remaining gaps between your old-style, RDD-focused code and the new Dataframe approach.

Let's take a look at the canonical word-count approach. In the RDD world, we'd do something like this:

val lines = sc.textFile("shakespeare.txt")

val words = lines.flatMap(_.split(" "))

val counts = words.groupBy(_.toLowerCase).map(w => (w._1, w._2.size))

But in the Dataframes/Dataset approach, we'd first use the as() method to declare our Dataframe of type String (we can use case classes or Java Beans to type more complex data):

val lines = sqlContext.read.text("shakespeare.txt").as[String]

val words = lines.flatMap(_.split(" "))

val counts = words.groupBy(_.toLowerCase).count()

You can see that we have access to the traditional RDD-like functions such as flatMap(), but also the standard Dataframe count() aggregation.

This may not seem all that exciting, but under the hood, quite a bit is going on. Given that this is running on top of a Dataframe, the execution plan is created via the Catalyst query planner, meaning it will produce a more optimized plan than you'd get with the old RDD code. Work is continuing on a dynamic optimization process that will allow Catalyst to alter the execution plan if it decides it can process incoming data in a more efficient fashion.

Also, the type information presented in the formation of a dataset is used to create an Encoder for serialization purposes. Again, if you've used Spark in production, you've likely come across memory issues due to the overhead of Java serialization to send data across the cluster (and you've likely switched to Kryo serialization). Encoders are optimized code generators that create custom bytecode for serialization and deserialization (Spark 1.6 includes Encoders for primitives, case classes, and JavaBeans, and will be opening the API up in future releases), and according to Databricks, they provide more than 10 times the performance of Kryo while using less memory.

Datasets are going to be an important part of Spark 2.0. Although the Dataset API is marked as "experimental" in Spark 1.6, I'd suggest diving in right now to get a head start on the new approach to manipulating your data.

And the rest ...

There's more to Spark 1.6 than these highlights. Again, check the release notes for everything that's been added (I'm happy to see better JSON support and multitenant SQLContexts, for example). With Spark 2.0 on the horizon, perhaps this will be the year that Spark becomes the default big data processing platform.

Next read this:

Ian Pointer is a senior big data and deep learning architect, working with Apache Spark and PyTorch. He has more than 15 years of development and operations experience.