By Andrew C. Oliver, Columnist, InfoWorld |

Forgot about Mahout? It’s back, and worth your attention

Mahout is a vibrant machine learning project that is now riding Spark instead of MapReduce for the algorithmically inclined

My tough life required me to fly to Miami and attend ApacheCon. I happened across a talk by Trevor Grant, an open source technical evangelist for the financial services sector, on Mahout. I thought, “Wait, isn’t Mahout dead?” Apparently not. In fact, Mahout is very much alive, nothing like what you once knew of it, and now running on GPUs.

Mahout was the original machine learning framework for Hadoop. When MapReduce was the thing, Mahout was the vaunted elephant rider. But then, as Grant recalls, “Mahout 0.09 released and all the Hadoop vendors froze at 0.09+. It was 0.09 with some bug patches. No one ever bumped up to 0.10.”

Nonetheless, the Mahout project is still active. “A lot of the projects have people paid to work on them, but Mahout doesn’t. We’re like a bunch of gypsies that wander around in companies like the MapRs of the world,” Grant says. “All the Mahout and former Mahout people are in very, very high places in Fortune 500 companies or CTOs of startups, but we don’t have a company of our own. Lucidworks is the closest thing. I didn’t realize but there are a lot of Mahout committers and PMCs [project management committees] kind of lurking about at Lucidworks.” (Full disclosure: I didn’t realize this either, even though I work for Lucidworks. —AO.)

The advantages of the Mahout you don’t know

Under the guidance of those “gypises,” Mahout developed some unique advantages. First, it was made engine-neutral. Although Spark is the recommended engine, Mahout supports other engines and bindings to your own favored engine.

“GPU integration is the other big huzzah, the big sexy thing that we’ve got going on right now,” Grant says. You can accelerate Spark, Flink, or any JVM-distributed engine; you get GPU acceleration for free. This is a big win.”

Unlike other tools, “Mahout is primarily about writing your own algorithms quickly and efficiently and mathematically expressively so you can read and other people can see what you’ve done—and the code makes sense,” Grant says. If MLlib has exactly what you’re looking for, great. But if it doesn’t, you’ll find it difficult to extend. On the other hand, while working on your algorithms in Python or R, you may find that Python isn’t so great in production. Mahout gives you Scala that you write with paradigms more familiar to R or Python.

Grant says Mahout’s “quintessential use case is that you read an academic journal article on Monday morning. You spend Monday afternoon grokking it and how it works. On Tuesday, you open up Mahout and start implementing the algorithm. By Tuesday afternoon, the algorithm is working, and you’re testing and making sure it works the way you think it is going to. On Wednesday morning, you’re writing docs and unit tests. On Wednesday afternoon, you have an algorithm in production.”

Another advantage of Mahout is its integration with Zeppelin, which lets you also use R and Python visualization tools like Ggplot2 or Pyplot rather than rolling your own visualization. If you’re playing with your data and algorithms, having visualization tools available rather than starting from scratch in Scala is important.

An example of what Mahout is really good at

If you’re starting out and looking to learn, Mahout has a few interesting “hello world”-style tutorials. Grant says that “the ‘hello world’ of Mahout is ordinary least-squares regression. It’s an algorithm, but it is still a fairly simple one. It’s easy and well documented. In maybe six to 12 lines of code you can implement ordinary least-squares regression in Mahout.”

But once you’ve gotten that far, Mahout has “another really good one that you’ve probably seen elsewhere is an alternating least-squares (ALS)-based recommender tutorial. The problem with ALS is that it is single-modal. It’s [based on] ratings, and you can make an adjustment on similar ratings from another person with a similar rating,” Grant says. “That’s great except in the real world you have a lot more information like user profile data, age, gender, and viewing habits. You’re throwing a lot of that out when you’re single-modal.

“ALS is just a matrix factorization, and we definitely have something to do that matrix factorization. But we also have correlated co-occurrence algorithms that are multimodal. So, for example, they all need to have the same user space but let’s say your primary action is buying a product but we also have information about page views. You viewed a bunch of products and added to products to cart. There a bunch of things that are product-focused, but we also have your gender, location, total lifetime buy, and favorite color or whatever. [From all that, Mahout] will generate recommendations about correlated co-occurrences, and you’re capturing all of that much richer set of information.”

If you might port engines one day (Spark isn’t forever) or need to write your own or tweak your algorithms and want GPU acceleration—and you want to do this in a maintainable way at scale—maybe Mahout is your easy rider. However, you’ll need to download your own copy rather than use the rusty one in your favorite Hadoop distribution. (The Mahout people would like you to forget they ever knew what MapReduce is.)

Mahout isn’t dead; it is a vibrant project that is now riding Spark instead of MapReduce. As a developer, I’ll probably wait around until someone else writes all the algorithms. But if I were more “mathy,” I’d be taking a hard look at Mahout right now.

Next read this:

Andrew C. Oliver is a columnist and software developer with a long history in open source, databases, and cloud computing. He founded Apache POI and served on the board of the Open Source Initiative. Oliver has helped with marketing in startups including JBoss, Lucidworks, and Couchbase. He advises startups on marketing, growth, and outreach.