Hadoop co-creator: Spark is great -- but people want more

Doug Cutting anticipates growth ahead and opportunities all around for the Hadoop ecosystem

Ten years after its creation, the Hadoop ecosystem is sprawling and  ever transforming. InfoWorld's Andy Oliver went as far as to say, "The biggest thing you need to know about Hadoop is that it isn’t Hadoop anymore" -- at least, not Hadoop as we once knew it.

Hadoop co-creator Doug Cutting, now with Cloudera, sees all this change as not only a positive development, but as vindication of Hadoop's open source origins and design.

In a phone conversation with InfoWorld, Cutting noted "having a loose confederation of a lot of open source projects permits evolution at a fundamental level." In this model, the market determines which components are used.

Over time, individual parts of Hadoop's ecosystem have grown beyond Hadoop itself. Case in point: Spark, the real-time data-processing framework, has developed an independent following.

However, Cutting believes the rest of Hadoop provides a lot that Spark can't lay claim to. "Spark is a great execution engine," he said, "and that's where we see most Spark adoption, as an execution engine on top of HDFS." (Spark typically replaces the older MapReduce engine, with YARN or Mesos, sometimes both, as a scheduler.)

But Cutting notes, "There's a lot of things Spark isn't." For instance, it isn't a full-text search engine; Solr assumes that role in the Hadoop world. One can run SQL queries against Spark, but it isn't designed to be an interactive query engine; for that, Cutting said, there's Impala.

"If all you need is streaming programming or batch programming, and you need an execution engine for that, Spark is great. But people want to do more things than that -- they want to do interactive SQL, they want to do search, they want to do various sorts of real-time processing involving systems like Kafka.... I think anyone who says 'Spark is the whole stack' is doing a necessarily limited number of things."

Another change over the years -- by necessity -- concerns security. Because of its origins as an internal tool within Yahoo, Hadoop had no real security to begin with, especially not the the finer-grained RBAC-type safeguards required of enterprise-grade products these days. "Folks building Web search engines and such tended to do security-by-firewall," said Cutting, but he noted that Hadoop's security is now fine-grained enough to include ACLs for tables or cells within tables.

Given Hadoop's evolution, what implications does it have for the protection of data already in the system? "What we've seen more often," said Cutting, "is that folks are required to address [data security] by their organization before they put something into production, before they store the data. That's been a limiter on what sorts of things people build." Now that Hadoop has more security features, he said, "it can be used in more places."

Cutting mentioned two other limiters for Hadoop adoption: the skill sets of the users, and the rates at which enterprises build new systems. "Not everybody is up to speed on the tools," he said of the former, and of the latter, "[Enterprises] mostly run existing systems; they don't rewrite everything every year, so those things take time as well."

Despite these obstacles, Cutting remains confident that the constant activity within the Hadoop ecosystem will keep it healthy. The Kudu filesystem, developed by Cloudera to merge features from HBase and HDFS, "shows how the ecosystem can evolve."

Though it's still technically alpha, some of Cloudera's customers are already using it in production. But Cutting also remarked that Kudu has been integrated into other Hadoop engines, including Apache Drill (which isn't included in Cloudera's distribution). 

"That other people have been voting to embrace [Kudu] is a real vote that it's something of interest," said Cutting.

Copyright © 2016 IDG Communications, Inc.