Big data brawlers: 4 challengers to Spark

Spark isn't the only option for handling big data at scale and in memory. These four contenders take on both stream processing and batch jobs

Big (and even not so big) data hasn't been the same since Apache Spark made inroads with developers and became a staple ingredient in big data clouds.

But Spark is far from perfect. It's certainly improving, as version 2.0 shows, but if a competitor offers a better handle on what Spark does and more, developers will pay attention.

Here are four projects emerging as possible competition for Spark, with new approaches to handling the conventional in-memory batch processing Spark is famous for and the streaming Spark continues to work on.

Apache Apex

What it is: Originally created by DataTorrent, Apex has since been donated to the Apache Foundation. It performs both stream and batch processing on Hadoop under YARN.

How it competes: Apex's streaming is the real deal, while Spark's "streaming" is actually a microbatch system. It also has native support for fault-tolerance by way of Hadoop -- though that means Apex and Hadoop are tightly coupled. Spark can work with or without Hadoop, and Apex doesn't yet have Spark's machine learning features.

Heron

What it is: Twitter's replacement for the Apache Storm stream-processing framework, Heron is now available as an open source project. Consider this a contender for Spark streaming.

How it competes: Heron runs streaming jobs via containers managed through a scheduler. To that end, it not only scales more readily than other solutions, but is easier to debug, deploy, and keep running well on clusters. It's also designed to appeal to existing Storm users, since it's compatible with the Storm API and shares many of Storm's concepts (such as "spouts" and "bolts").

Apache Flink

What it is: Apache Flink is a stream-processing library that competes with Apache Storm as much as Spark.

How it competes: Like Apex, Flink puts streaming first, and it uses a true streaming model rather than Spark's streaming via microbatch. Flink also has provisions for performing iterative or repeated processing on streams, and it includes Spark-like features, such as machine learning and graph processing libraries. But Flink is still a relatively new project, having hit 1.0 earlier this year.

Onyx

What it is: Onyx is a "masterless, cloud scale, fault tolerant, high performance distributed computation system," according to its documentation, with both batch and stream processing models.

How it competes: Written in the functional language Clojure rather than Scala, Onyx puts streaming first -- batch operations are basically implemented as ministreams. Onyx also allows the developer to use language constructs in either Clojure or Java, such as Clojure's vectors and maps, to define how data is processed. (Many of Onyx's goals were laid down before the code was even created.) If Onyx catches on, it'll most likely be due to Java's existing popularity rather than Clojure's expressiveness.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.