Apache Arrow Design and the Future of Hardware

By Jacques Nadeau

Jun 14, 2017

The Apache Arrow project is a standard for representing data for in-memory processing. The project is designed to be used internally by other software projects for data analytics. It is not uncommon for users to see 10x-100x improvements in performance across a range of workloads when using this standard.

Hardware evolves rapidly. Because Apache Arrow was designed to benefit many different types of software in a wide range of hardware environments, the project team focused on making the work “future-proof,” which meant anticipating changes to hardware over the next decade.

What is Apache Arrow?

Columnar data structures provide a number of performance advantages over traditional row-oriented data structures for analytics. These include benefits for data on disk - fewer disk seeks, more effective compression, faster scan rates - as well as more efficient use of CPU for data in memory. Apache Arrow standardizes an efficient in-memory columnar representation that is the same as the wire representation. It includes first class bindings in many projects, including Spark, Hadoop, R, and Python/Pandas.

The trade-offs for processing columnar data in memory are different than on disk. For data on disk, usually IO dominates latency, which can be addressed with more compact encodings and compression, at the cost of CPU. In-memory access is much faster and CPU becomes the scarce resource. To maximize in-memory processing efficiency Apache Arrow focuses on cache locality, pipelining, and SIMD instructions.

How is Apache Arrow Used Today

The Apache Arrow project team includes members from many different software communities, all working together to improve the efficiency of data analytics. When different programs need to interact with data - within and across languages - there are inefficiencies in the handoffs that can dominate the overall cost. The team viewed these handoffs as a significant bottleneck for in-memory processing, and set out to develop a common set of interfaces that remove unnecessary serialization and deserialization when marshalling data.

Apache Arrow standardizes an efficient in-memory columnar representation that is the same as the wire representation. Today, it includes first-class bindings in many projects, including Spark, Hadoop, R, and Python/Pandas. When using Apache Arrow, any of these programs can efficiently access, process, and share data structures without the overhead of serializing and deserializing into different in-memory structures.

Vectorized processing

Since it was popularized by MonetDB, vectorized processing is the state of the art to speed up data processing on modern hardware. This model requires columnar representation of the data to take advantage of the optimizations embedded in processors.

How does it work? First, the CPU doesn’t execute one instruction at a time anymore. Instead, CPUs now process multiple instructions in parallel. To take advantage of instruction-level parallelism, each instruction is broken up into a series of steps in the processor pipeline. Their execution is staggered, meaning the next instruction is started before the previous one is finished.

This works as long as there are no dependency between the instructions. If there are dependencies, the CPU waits for the entire pipeline to be finished before we starting the next one, which can waste many processor cycles. To work around this, the processor has a branch prediction unit that optimistically guesses which branch to follow. Vectorized processing on columnar data allows branch prediction algorithms to make all branching decisions outside of the inner processing loop, which consumes a negligible number of cycles while achieving maximum throughput.

The second way vectorized processing works is by speeding up how the CPU processes data from CPU cache. here are several layers of cache embedded in the CPU. As long as a process is working on data in this cache, then it is very fast. But when the process has to fetch data from main memory, there is a pause as processing slows down significantly (yet orders of magnitude less than fetching data from persistent storage).

With columnar data, there tends to be much less wasted cache. For example, consider a row of data with 10 columns and a process that analyzes a single column. In a row-oriented data model, nine columns would occupy cache unnecessarily, limiting the number of values that can fit into cache. With columnar data only the column being analyzed would be read into cache, allowing for many more values to be processed together. Ultimately, columnar data makes processing more efficient by reducing the amount of time the CPU spends waiting for something to do.

SIMD

Modern processors have extended instruction sets that extend the concept of vectorized execution to multiple values in parallel. SIMD (Single Instruction Multiple Data) instructions became mainstream in desktop computing in the 1990s to improve the performance of multimedia applications, such as gaming.

As Wikipedia explains, SIMD is especially beneficial for applying the same change to many values, such as adjusting the brightness of an image. Each pixel’s brightness is defined by the values for red (R), green (G) and blue (B). To change the brightness, the R, G and B values are read from memory, the values are adjusted, and the results are written back out to memory. Without SIMD, pixel RGB values are read into memory individually. With SIMD, blocks of pixel RGB values can be processed together in a single instruction. This is vastly more efficient.

Theseconcepts are very applicable to processing data of many types in the world of analytics. SIMD exploits data level parallelism independent of concurrency (see the section below on multi-threaded programming). SIMD instructions are a way to execute the same instruction on multiple values at a time in the same clock cycle, literally multiplying by 4 or more the throughput of execution. Ideally, a process applies the same function to many values with organized processing in tight loops that do the same thing over and over. A columnar layout organizes similar values adjacent to one another, so it is relatively easy to take advantage of SIMD instructions. Arrow will also make sure values are properly aligned in memory to take maximum advantage of SIMD instructions.

Distributed computing, the cloud, and serverless

Cloud platform vendors like AWS have steadily built out their service offerings, incorporating more layers of the technology stack and simplifying what it takes to build an application. One of the key advantages of these offerings is their inherent elasticity - you can easily grow your use of resources and only pay for what you use.

These cloud services are actually an organized and orchestrated set of discrete hardware and software resources. For example, AWS S3 relies on many low-level hardware and software components that work together to reliably and cost-effectively store objects. While Amazon does not disclose details on these components, they consist of many computers and many programs in different languages, working together. And AWS margins are determined in part by how much utilization they can exploit from these resources.

Most programming languages assume a single server for their execution environment. Today and going forward, computing will be defined by the orchestration of many hardware and software resources working together. Apache Arrow was designed to provide efficient processing of data across many resources in parallel (see the section on multi-threaded programming), to simplify access to shared data structures, and to take advantage of new trends like RDMA (see the section on RDMA below) that are especially useful in distributed computing environments.

CPU cores and multi-threaded programming

For most of the past decade CPU speeds have not increased. Instead, CPUs have moved to increased cores. To take advantage of this change in architecture, software needs to maximize concurrency. One approach is to divide work into many discrete, parallel operations, and to coordinate these operations effectively as multi-threaded, parallel operations. Java, for example, provides extensive support for multithreading.

Another approach to increase concurrency is to build programs that operate asynchronously, which frees the CPU to perform work while it is waiting on other resources. JavaScript, for example, is single-threaded, but Node.js is asynchronous and capable of taking advantage of many cores efficiently. In both cases, there tends to be a greater number of threads, and therefore a greater number of transfers of objects between threads, or shared access to the same memory structures.

Apache Arrow was designed with these newer programming models in mind. When two processes interact with each other, they need to find a common language. This often ends up being the lowest common denominator and an inefficient format. Often memory representations are based on pointers using absolute addresses that can not be relocated without modifying the data. A lot of overhead is incurred by transforming an in-memory representation to a serialized representation back to a different in-memory representation.

Arrow’s representation only uses relative offsets and can be relocated without modification. Since the Arrow representation is the same in memory and on the wire, there is no overhead to transform it. The data can be directly sent on the wire using scatter-gather IO and retrieved in the same way. Once one process has finished writing to an Arrow Record Batch it can seal to make it immutable from that point on and share it in read-only form with other processes. This model make concurrent access and coordination dead simple since there is always either a single writer or multiple readers of immutable data.

If multiple systems are on the same machine, read-only shared memory can be used to share Arrow data without any copy involved, effectively removing all overhead from inter-process communication.

RDMA (remote direct memory access)

Remote Direct Memory Access is a mechanism that enables a process to access another computer’s memory without going through it’s CPU. This is beneficial for two reasons. First, it extends the read-only shared memory approach to the entire network (see the section above on multi-threaded programming) and freeing more CPU time when no coordination is required.

RDMA used to require special networking infrastructure. Today RDMA is becoming easier to implement on regular ethernet networks and commodity hardware clusters.

Since Arrow Record Batches are immutable after they’ve been sealed, they can be shared across the network without any more coordination required from their producer. This frees resources from the producer that can focus on doing more work while downstream operators consume the result of its previous work.

As Hardware Evolves

These are several of the key areas of hardware the Apache Arrow team considered when developing the project. Apache Arrow was designed to improve the efficiency of many data processing systems, and over a dozen projects already provide bindings for this columnar in-memory representation, including Hadoop, Spark, Python/Pandas, and R. Hardware will continue to evolve rapidly, and software will evolve to take advantage of these innovations.