Apache Arrow aims to speed access to big data

Apache's new project leverages columnar storage to speed data access not only for Hadoop but potentially for every language and project with big data needs

A new top-level project for the Apache Foundation seeks to provide a fast in-memory data layer for an array of open source projects, both under Hadoop's umbrella and outside it.

The Apache Arrow project transforms data into a columnar, in-memory format -- so that it's far faster to process on modern CPUs -- and provides it to a variety of applications via a single, consistent interface.

Arrow was developed by employees from a number of companies behind various open source efforts: Cloudera, Databricks, Datastax, Salesforce, Twitter, and others. Among the companies involved is Dremio, a startup founded by ex-employees of Hadoop company MapR who also produced Apache Drill.

Data, get in line

Columnar storage is used in big data applications to accelerate searching and sorting large amounts of information, but it's been up to individual big data framework components to support it. (The Apache Parquet project provides support for columnar storage in Hadoop.)

With Arrow, applications can access a columnar version of a dataset simply by asking Arrow for it. Data transformed by Arrow can theoretically be processed much faster, since Arrow exploits the SIMD (Single Instruction Multiple Data) instruction sets of modern CPUs to speed up processing the data. Sets of data too big to fit in memory all at once are broken into batches, with the batches sized to fit the CPU's cache layers.

According to its creators, the big boon for Arrow is not only does speed up any one big data project, but multiple Arrow-compatible projects can use Arrow as a common data interchange mechanism. Instead of serializing, moving, and deserializing any given dataset between projects -- with all of the overhead and slowness implied -- applications that use Arrow can trade data directly in Arrow's format.

If two applications are on the same physical node, they can access Arrow data by way of shared memory. This speeds up data access, since the applications are no longer making redundant copies of the data.

A little something for everyone

According to Julien Le Dem -- PMC member of Arrow, architect at Dremio, and VP of Apache Parquet -- columnar optimization has long been used in commercial products like Oracle's databases and SAP HANA. But the open source big data space has, Le Dem claimed, done very little with this type of technology to date.

Many projects within Hadoop are already preparing Arrow support. But Arrow is aiming to create a product that serves more than the Hadoop ecosystem and instead connects entire software ecosystems -- the Python language, for instance.

Arrow's creators claim this is already happening. Le Dem says the plan includes specific projects (Spark, HBase, Cassandra, Pandas, Parquet, Drill, Impala, and Kudu, for openers), as well as provides bindings for entire languages. C, C++, Python, and Java bindings are available now, and R, Julia, and JavaScript will follow. "This is not just about Hadoop," Le Dem said. "[The participants] are engaged across a wide spectrum of different projects, so this is relevant to an extraordinarily large number of [them]."

In that light, Arrow is an example of open source big data technologies growing beyond Hadoop. They start with Hadoop, perhaps, but aren't necessarily confined to it.

That said, Arrow and Hadoop unquestionably have a future together; Le Dem noted that three of the leading commercial Hadoop distributions -- Cloudera, Hortonworks, and MapR -- have all committed to working on Arrow.

[This article was updated to amend information about the creators of Apache Arrow, as well as Julien Le Dem's titles and positions.]

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.