Yahoo schools Spark on deep learning

CaffeOnSpark works with the big data platform to create applications like speech or image recognition

Yahoo has created a deep learning system for creating predictive applications like speech or image recognition. While these systems are already delivered by open source projects like Google TensorFlow or Microsoft CNTK, Yahoo stands apart by leveraging a major force in big data processing: Spark.

CaffeOnSpark, introduced yesterday in a blog post, builds on the Caffe deep-learning framework developed by the Berkeley Vision and Learning Center (Yahoo is one of its sponsors).

Spark features an array of machine learning algorithms, introducing new ones in each successive revision. But deep learning -- training a neural net with a mass of data and using it to make decisions -- isn't part of its portfolio.

CaffeOnSpark addresses that by accepting data prepared by a Spark application and allowing the resulting predictions to be extracted by Spark via SQL query or its other machine learning libraries.

tumblr inline o2voaoqngi1t17fny 500 — CaffeOnSpark melds Spark's in-memory processing with the Caffe deep learning framework, allowing Spark users to train datasets and derive insights from models via system they're already comfortable with.

The Spark and Caffe nodes can sit side by side on the same hardware, meaning the data doesn't have to be moved around as much and thus can be processed faster. Training jobs can also have their state periodically checkpointed, so a long-running job can be paused and resumed, or recovered in the event of a crash.

Launching applications and running processing in CaffeOnSpark are done by way of the existing Spark command set, for the sake of familiarity. Also, the existing Spark command set launches applications and runs processing in CaffeOnSpark. But CaffeOnSpark instances running on different nodes don't communicate with each other through Spark. Instead, they use their own system, MPI, which can be routed over Ethernet or RDMA/Infiniband, to avoid bottlenecks.

The biggest advantage to CaffeOnSpark is its use of an existing big data processing tool that's already achieved a great deal of user and developer momentum. Google and Microsoft tout ease of use as chief advantages of their solutions, but familiar tools always help the transition to a new workflow or data paradigm, especially given Spark's reputation for accessibility and simplicity.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.