Why the New Apache Arrow Project Matters

The core reason for implementing in-memory technology is to improve performance. To help accelerate adoption of in-memory technologies and provide a universal standard for columnar in-memory processing and interchange, the lead developers of 13 major open source big data projects have joined forces to create Apache Arrow (http://arrow.apache.org), a new top level project within the Apache Software Foundation (ASF).

"No matter what system you are looking at, when you are talking about analytical experience, everybody is asking for faster performance. It doesn’t matter how fast the system goes, they are always asking for something faster,” said Jacques Nadeau, vice president of Apache Arrow, as well as CTO of stealth data analytics company Dremio. “We recognized this need in multiple projects that are operating in big data and open source.” 

A high performance cross-system data layer for columnar in-memory analytics, Apache Arrow is said to accelerate the performance of analytical workloads by more than 100x in some cases, as well as support multisystem workloads by eliminating cross-system communication overhead.

The Arrow project grew from three interconnected analytical trends and requirements. First, columnar data storage is gaining ground for analytical workloads, led in part by the creation of Apache Parquet and other technologies. Second, in-memory systems such as SAP HANA and Spark accelerate analytical workloads by holding data in memory, part of a trend for people to be increasingly unwilling to wait for data. And third, real-world objects are easier to represent as hierarchical and nested data structures, leading to an increase of formats such as JSON and document databases.

At the end of the day, the goal is for users’ workloads to go faster when they use Arrow and for them to be able to use Arrow naturally as it gets adopted into the big data systems that they are already comfortable using, Nadeau explained. In addition to workloads going faster using Arrow, a second benefit, he said, is that as systems adopt Arrow independently, it will also allow those systems to work better in collaboration. “It is really about, one, making each system faster on its own; and two, this ability to move data between systems."

Key components of Arrow include a set of defined data types, including both SQL and JSON types; canonical columnar in-memory representations; common Arrow-aware companion data structures; IPC (inter-process communication) through shared memory, TCP/IP and RDMA (remote direct memory access); libraries for reading and writing columnar data in multiple languages; pipelined and SIMD (single instruction, multiple data)algorithms for various operations; columnar in-memory compression techniques; and tools for short-term persistence to non-volatile memory, SSD or HDD.

Initially seeded by code from the Apache Drill project, Apache Arrow was built on top of a number of open source collaborations, and aims to become the de facto standard for columnar in-memory processing and interchange.  Beyond the initial support and input from the lead developers of 13 major open source big data projects, it is expected that additional companies and projects will adopt and leverage the technology in the coming months, and contribute back to the project, Nadeau said.

Code committers to Apache Arrow include developers from Apache Big Data projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark, and Storm as well as established and emerging open source projects such as Pandas and Ibis. By attracting broad support and contributions from the big data and open source community, it is hoped the technology will become “first class and world-defining," said Nadeau.

According to Nadeau, the roadmap for the project is starting with initial support for languages such as Java, C++, C, and Python, and will include others such as R, Julia, and JavaScript.  In addition, it is expected that other technologies will become “Arrow-enabled,” including Drill, Impala, Kudu, Spark, and Ibis.  “We expect substantial additional engagement from open source as well as commercial technologies,” said Nadeau. In order for Arrow to become “foundational component” as a standardized in-memory representation, there needs to be a consensus built around it, he said.


Related Articles

The Apache Arrow project is a standard for representing data for in-memory processing.Hardware evolves rapidly. Because Apache Arrow was designed to benefit many different types of software in a wide range of hardware environments, the project team focused on making the work "future-proof," which meant anticipating changes to hardware over the next decade.

Posted June 14, 2017

The Apache Arrow project is a standard for representing columnar data for in-memory processing, which has a different set of trade-offs compared to on-disk storage. In-memory, access is much faster and processes optimize for CPU throughput by paying attention to cache locality, pipelining, and SIMD instructions.

Posted September 12, 2017


Newsletters

Subscribe to Big Data Quarterly E-Edition