Spark 2.0 prepares to catch fire

Today, Databricks subscribers can get a technical preview of Spark 2.0. Improved performance, SparkSessions, and streaming lead a parade of enhancements

Apache Spark 2.0 is almost upon us. If you have an account on Databricks’ cloud offering, you can get access to a technical preview today; for the rest of us, it may be a week or two, but by Spark Summit next month, I expect Apache Spark 2.0 to be out in the wild. [Editor's note: The preview release can now be downloaded from the Apache Spark site.] What should you look forward to?

During the 1.x series, the development of Apache Spark was often at a breakneck pace, with all sorts of features (ML pipelines, Tungsten, the Catalyst query planner) added along the way during minor version bumps. Given this, and that Apache Spark follows semantic versioning rules, you can expect 2.0 to make breaking changes and add major new features.

Unify DataFrames and Datasets

One of the main reasons for the new version number won’t be noticed by many users: In Spark 1.6, DataFrames and Datasets are separate classes; in Spark 2.0, a DataFrame is simply an alias for a Dataset of type Row.

This may mean little to most of us, but such a big change in the class hierarchy means we’re looking at Spark 2.0 instead of Spark 1.7. You can now get compile-time type safety for DataFrames in Java and Scala applications and use both the typed methods (map, filter) and the untyped methods (select, groupBy) in both DataFrames and Datasets.

The all-new and improved SparkSession

A common question when working with Spark: "So, we have a SparkContext, a SQLContext, and a HiveContext. When should I use one and not the others?" Spark 2.0 introduces a new SparkSession object that reduces confusion and provides a consistent entry point for computation with Spark. Here’s what creating a SparkSession looks like:

val sparkSession = SparkSession.builder

.master("local")

.appName("my-spark-app")

.config("spark.some.config.option", "config-value")

.getOrCreate()

If you use the REPL, a SparkSession is automatically set up for you as Spark. Want to read data into a DataFrame? Well, it should look somewhat familiar:

spark.read. json ("JSON URL")

In another sign that operations using Spark's initial abstraction of Resilient Distributed Dataset (RDD) are being de-emphasized, you'll need to get at the underlying SparkContext using spark.sparkContext to create RDDs. Once again, RDDs aren't going away, but the preferred DataFrame paradigm is becoming more and more prevalent, so if you haven’t worked with them yet, you will soon.

For those of you who have jumped into SparkSQL with both feet and discovered that sometimes you had to fight the query engine, Spark 2.0 has some extra goodies for you as well. There's a new SQL parsing engine which includes support for subqueries and many SQL 2003 features (though it doesn't claim full support yet), which should make porting legacy SQL applications to Spark a much more pleasant affair.

Structured Streaming

Structured Streaming is likely to be the new feature that everybody is excited about in the weeks and months to come. With good reason! I went into a lot of detail about what Structured Streaming is a few weeks ago, but as a quick recap, Apache Spark 2.0 brings a new paradigm for processing streaming data, moving away from the batched processing of RDDs to a concept of a DataFrame without bounds.

This will make certain types of streaming scenarios like change-data-capture and update-in-place much easier to implement -- and allow windowing on time columns in the DataFrame itself instead of when new events enter the streaming pipeline. This has been a long-running thorn in Spark Streaming's side, especially in comparison to competitors like Apache Flink and Apache Beam, so this addition alone will make many happy to upgrade to 2.0.

Performance improvements

Much effort has been spent on making Spark run faster and smarter in 2.0. The Tungsten engine has been augmented with bytecode optimizers that borrow techniques from compilers to reduce function calls and keep the CPU occupied efficiently during processing.

Parquet support has been improved, resulting in a 10-fold speed-up in some cases, and the use of Encoders over Java or Kryo serialization, first seen in Spark 1.6, continues to reduce memory usage and increase throughput in your cluster.

ML/GraphX

If you’re expecting big changes in the machine learning and graphing side of Spark, you might be a touch disappointed. The important change to Spark's machine learning offerings is that development in the spark.mllib library is frozen. You should instead use the DataFrame-based API in spark.ml, which is where development will be concentrated going forward.

Spark 2.0 brings full support for model and ML pipeline persistence across all of its supported languages and makes more of the MLLib API available to Python and R for all of your data scientists who recoil in terror from Java or Scala.

As for GraphX, it seems to be a bit unloved in Spark 2.0. Instead, I'd urge you to keep an eye on GraphFrames. Currently a separate release from the main distribution, this builds a graph processing framework on top of DataFrames that is accessible from Java, Scala, Python, and R. I wouldn't be surprised if this UC Berkeley/MIT/Databricks collaboration finds its way into Spark 3.0.

Say hello, wave good-bye

Of course, a new major version number is a great time to make breaking changes. Here are a couple of changes that may cause issues:

Dropping support for versions of Hadoop prior to 2.2
Removing the Bagel graphing library (the pre-cursor to GraphX)

An important deprecation that you will almost certainly run across is the renaming of registerTempTable in SparkSQL. You should use createTempView instead, which makes it clearer that you're not actually materializing any data with the API call. Expect a gaggle of deprecation notices in your logs from this change.

Should I rush to upgrade?

With promised large gains in performance and long-awaited new features in Spark Streaming, it's tempting to hit Upgrade as soon as Apache Spark 2.0 becomes generally available in the next few weeks.

I would temper that impulse with a note or two of caution. A lot has changed under the covers for this release, so expect some bugs to crawl out as people start running their existing code on test clusters.

Nonetheless, with a brace of new features and performance improvements, it's clear that Apache Spark 2.0 deserves its full version bump. Look for it in the next few weeks!

Next read this:

Ian Pointer is a senior big data and deep learning architect, working with Apache Spark and PyTorch. He has more than 15 years of development and operations experience.