Machine learning reviews

Review: Databricks makes big data dreams come true

Cloud-based Spark machine learning and analytics platform is an excellent, full-featured product for data scientists.

Contributor, InfoWorld |

Review: Databricks makes big data dreams come true

At a Glance

Databricks with Spark 1.6

Machine learning reviews

For those of you just tuning in, Spark, an open source cluster computing framework, was originally developed by Matei Zaharia at U.C. Berkeley's AMPLab in 2009, and later open-sourced and donated to the Apache Foundation. Part of the motivation for creating Spark is that MapReduce only allows a single pass through the data, while machine learning (ML) and graphing algorithms generally need to perform multiple passes.

Spark is billed as a “fast and general engine for large-scale data processing,” with a tagline of “Lightning-fast cluster computing.” In the world of big data, Spark has been attracting attention and investment because it provides a powerful in-memory data-processing component within Hadoop that deals with both real-time and batch events. In addition to Databricks, Spark has been embraced by the likes of IBM, Microsoft, Amazon, Huawei, and Yahoo.

Spark includes MLlib for distributed machine learning and GraphX for distributed graph computation.

spark ecosystem — The Spark core supports APIs in R, SQL, Python, Scala, and Java. Additional Spark modules include Spark SQL and DataFrames; Streaming; MLlib for machine learning; and GraphX for graph computation.

MLlib is of particular interest in this review. It includes a wide range of ML and statistical algorithms, all tailored for the distributed memory-based Spark architecture. MLlib implements, among other items, summary statistics, correlations, sampling, hypothesis testing, classification and regression, collaborative filtering, cluster analysis, dimensionality reduction, feature extraction and transformation functions, and optimization algorithms. In other words, it’s a fairly complete package for data scientists.

Zaharia and others from the UC Berkeley AMPLab founded Databricks in 2013. Databricks is still a major contributor to the Spark project.

Databricks offers a superset of Spark as a cloud service. There are three plans, tiered by the number of user accounts, type of support, SLAs, and so on.

The recently announced free Databricks Community Edition, which is what I used for this review, provides access to a 6GB microcluster, a cluster manager, and the notebook environment, so you can prototype simple applications. It’s much easier to try out something on Databricks Community Edition than it would be to set up a Spark cluster for development in your shop.

databricks diagram — Databricks provides Spark as a cloud service, with some extras. It adds a cluster manager, notebooks, dashboards, jobs, and integration with third-party apps, to the free open source Spark distribution.

Databricks provides several sample notebooks for ML problems. Databricks notebooks are not only similar to IPython/Jupyter notebooks, but are compatible with them for import and export purposes. I had no problem applying my knowledge of Jupyter notebooks to Databricks.

Spark MLlib vs. Spark ML

Before I go through my experience with a sample notebook, I should explain that there are two major packages in Spark MLlib. Spark.mllib contains the original API built on top of Resilient Distributed Datasets (RDDs, the basic shared-memory abstraction in Spark); Spark.ml provides a higher-level API built on top of DataFrames, for constructing ML pipelines. In general, Databricks recommends that you use Spark.ml and DataFrames when you can, and mix in Spark.mllib and RDDs only to get functionality (such as dimensionality reduction) that is not yet implemented in spark.ml.

While Spark was new to me, the algorithms in Spark MLlib were very familiar. Like Microsoft Azure Machine Learning and IBM SPSS Modeler, Databricks gives you a wide assortment of methods that you can use as you please. Amazon Machine Learning, on the other hand, gives you one algorithm each for binary classification, multiclass classification, and regression. If you know what you’re doing around statistical model building, then having many methods to choose from is a good thing. If you’re a business analyst trying to get good predictions without knowing a lot about ML, then what you need is something that just works.

databricks mllib — Spark.ml, which uses DataFrames, and the older Spark.mllib, which uses RDDs, implement an excellent selection of machine learning algorithms.

I worked through the MLPipeline Bike Dataset example. This notebook uses Spark ML pipelines, which help users piece together parts of a workflow such as feature processing and model training, as well as model selection (aka hyperparameter tuning) using cross-validation to fine-tune and improve the Gradient-Boosted Tree regression model. The figure that follows shows the training step running on 70 percent of the data.

databricks model — This live Databricks notebook, with code in Python, demonstrates one way to analyze a well-known public bike rental data set. In this section of the notebook, we are training the pipeline, using a cross-validator to run many Gradient-Boosted Tree regressions.

You’ll note that the fitting step was predicted to (and did) take 10 minutes to run using a Community Edition 6GB, single-node microcluster. You can scale paid versions of Databricks to unlimited numbers of nodes and hundreds of gigabytes of RAM, although there are complications to consider if you want to exceed 200GB of RAM for a single cluster. Scaling out allows you to analyze much more data, much faster. Of course, using larger clusters costs more: You pay 40 cents per hour per node in addition to your monthly subscription fee.

After training, the notebook runs predictions and evaluations on the remaining 30 percent of the data set.

databricks prediction — After training on 70 percent of the bike rental data set, we run predictions from the best regression model and compare them to the actual values of the remainder of the data set. I have switched the display at the top of the image from the default table to a scatter chart of predicted versus actual rentals with local regression (LOESS) line that brings out the trends. If the correlation were nearly perfect the LOESS line would look straight.

This is the point where a statistician or data scientist would dive in and start plotting residuals, in preparation for tweaking the features, removing outliers, and refining the model.

Take a quick look at the time-of-day graph at the bottom of the figure above. The analysis in this notebook was oversimplified right from the beginning -- the weekday and weekend/holiday data were lumped together. As you might expect, weekday bike rentals peak strongly at the morning and evening rush hours, while weekend/holiday rentals are more evenly spread throughout the day; the graph you see shows them mixed together. You can extract separate data sets, train them separately, and get much lower mean square errors for each set. Of course, that’s real work, and I wouldn’t expect to see it in a demo notebook.

Easy as data science

My colleague Andy Oliver has suggested that Databricks is trying to compete with Tableau. I disagree. Databricks knows that its version of Jupyter notebooks is not in the same league as Tableau for ease of use, and the company has integrated with Tableau (and Qlik, for that matter) through the Databricks API.

Tableau is designed to be an exploratory business intelligence product that is simple enough for everyone at a company. Databricks, on the other hand, is designed to be a scalable, relatively easy-to-use data science platform for people who already know statistics and can do at least a little programming. I simply can’t imagine putting a business analyst in front of a Databricks notebook and asking her to build a prediction model from a terabyte of data held in Amazon S3 buckets. I’d have to train her in SQL and either Scala, R, or Python, then teach her about the Spark data formats and libraries.

No, I see Databricks as competing with IBM Watson and SPSS Modeler, Microsoft Azure Machine Learning, and Amazon Machine Learning. Meanwhile, IBM, Microsoft, and Amazon have all adopted Spark in their clouds and are contributing to the Apache Spark product. The relationship is probably coopetition, not pure competition.

Andy Oliver noticed some security flaws in Databricks notebooks when they were introduced in the spring of 2015. I haven’t seen similar issues in my brief hands-on review, but I can’t prove a negative.

I have bumped into a few glitches, but I’m working with a beta product, so I expected to see glitches. The worst bug I saw: One of Databricks’ demo notebooks failed to run to completion on an autostarted cluster that happened to be running an older version of Spark. Once I deleted that cluster and started a new cluster with Spark 1.6, a matter of less than a minute, the notebook ran without errors.

Overall, I see Databricks as an excellent product for data scientists, with a full assortment of ingestion, feature selection, model building, and evaluation functions. It has great integration with data sources and excellent scalability. Understood as a product that assumes its users can program, it has very good ease of development. Certainly the introduction of the free Community Edition takes most of the pain and risk out of trying the platform.

InfoWorld Scorecard	Variety of models (25%)	Ease of development (25%)	Integrations (15%)	Performance (15%)	Additional services (10%)	Value (10%)	Overall Score (100%)
Databricks with Spark 1.6	10	9	9	9	8	9	9.2

Next read this:

At a Glance

Databricks with Spark 1.6
Pros
- Makes it almost effortless to spin up and scale out Spark clusters
- Provides a wide range of ML methods for data scientists
- Offers a collaborative notebook interface using R, Python, or Scala, and SQL
- Free to start and inexpensive to use
- Easy to schedule jobs for production
Cons
- Not as easy to use as a BI product, although it integrates with several BI products
- Assumes that the user is familiar with programming, statistics, and ML methods

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi.