Columnar Data, Analytics, and the Evolving Role of Hardware

By Jacques Nadeau

Sep 12, 2017

Analytical workloads differ from transactional workloads in that most queries read a small subset of columns for very large numbers of rows. To optimize for these demands, most analytical databases implement a columnar storage model for data.

Columnar storage organizes the values for a given column contiguously on disk. This has the advantage of significantly reducing the number of seeks for multi-row reads. Furthermore, compression algorithms tend to be much more effective on a single data type rather than the mix of types present in a typical row. The tradeoff is that writes are slower, but this is a good optimization for analytics where reads typically far outnumber writes.

The Apache Arrow project is a standard for representing columnar data for in-memory processing, which has a different set of trade-offs compared to on-disk storage. In-memory, access is much faster and processes optimize for CPU throughput by paying attention to cache locality, pipelining, and SIMD instructions. It is not uncommon for users to see 10x-100x improvements in performance across a range of workloads when using this standard.

Here is a description of three key hardware innovations that were considered in designing the project: GPUs, FPGA, and 3D XPoint.

GPUs

GPUs have become mainstream in a variety of computing applications such as machine learning and big data analytics due to their advantages in processing efficiency. The GPU is a specialized processor optimized for parallel workloads, with up to 200x as many cores as a CPU. GPUs are designed to apply SIMD instructions to vectors of values, such as adjusting the brightness of pixels for an image. Arrow’s vector representation was designed to map directly to the vectors that GPUs can process. Arrow’s columnar layout organizes similar values adjacent to one another, so it is relatively easy to take advantage of SIMD instructions.

GPUs have their own memory (VRAM) whose bandwidth is 10x that of most CPUs. However, GPU RAM is typically limited to just a few GB. For datasets that do not fit in VRAM, data must be marshalled between system memory and VRAM where computations are performed by the GPU. Because VRAM is a scarce resource, it is very useful for multiple workloads to operate on a shared VRAM representation. We designed Arrow to work across languages and processes, enabling shared access to buffers that significantly improve efficiency.

GPUs' combination of thousands of cores and high bandwidth memory is well aligned with the needs of many analytical workloads. We designed Arrow with these advantages in mind - data is read from disk or off the network into Arrow buffers as vectorized, compressed, columnar structures. Each Arrow buffer represents a group of data that can be processed independently. This design allows GPUs to distribute large workloads across many cores, taking advantage of high speed parallelization and pipelining.

FPGAs

The intensive demands of some applications - such as deep neural networks (DNN), genomic data analysis, and recommendation systems - are forcing architects to consider specialized hardware alternatives. Field programmable gate arrays (FPGAs) are another type of specialized chip that excels in specific workloads. Instructions are programed into an FPGA - these chips are hardware implementations of algorithms - where they are performed much more quickly than in software. FPGA latencies are typically an order of magnitude less than that of GPUs – hundreds of nanoseconds vs. single-digit microseconds. FPGAs also consume much less power than CPUs or GPUs.

Historically, FPGA was harder to adopt due to their specific hardware needs and programming requirements. Today most major cloud platform vendors provide FPGA-equipped instances, and new software development tools are making it easier to integrate the benefits of FPGAs into new applications. (Google took a slightly different approach with their Tensor Processing Unit, an ASIC chip they designed for DNN workloads in TensorFlow.

Like GPUs, FPGAs are massively parallel and benefit from the size and structure of Arrow’s vectorized, compressed, columnar structures. And, also like GPUs, FPGAs have limited on-chip memory, and instead rely on multiple PCIe connections to system memory. Because Arrow is designed for multiple processes, different workloads can operate on the same Arrow buffers, saving scarce RAM resources.

Arrow has additional benefits for DNN workloads, which work by performing multiple multiplication/accumulation steps on vectorized data structures expressed as matrix multiplication. Arrow was designed to be efficient for matrix multiplication, with machine learning in mind, including DNN. In addition, Arrow can serve as both the input and intermediate data structures for these processes, accelerating subsequent steps in the overall process.

3D XPoint SSDs

Earlier this year, Intel finally announced availability of its new 3D XPoint product called Optane. This is a new class of storage with performance and cost somewhere between that of NAND flash SSDs and DRAM. Intel claims that Optane is built for applications with high read/write loads, and requirements for low latency. Per Intel’s original claims, Optane will provide about 10x the density of DRAM and about 1000x better latency than SSD. Two other important differences compared to NAND flash are that Optane has much higher endurance over time, and Optane storage is byte-addressable. In contrast, NAND flash provides page and block-level access to data, which adds significant overhead for write-intensive applications.

Intel also released Memory Drive Technology, which makes Optane available as RAM for Xeon processor chips. Memory Drive Technology combines regular DRAM with SSDs to provide a single pool of volatile memory. While this will be slightly slower than DRAM overall, the combination of RAM and Optane can significantly increase the amount of physical memory in a server: 2 socket Xeon systems can hold up to 24TB of Optane (vs 3TB of RAM), and 4 socket systems support up to 48TB of Optane (verus 12TB RAM).

Optane and similar products will increase the amount of data that can be stored in memory, cost effectively and with low power demands. Because Arrow is designed for in-memory processing, much more of an organization’s data can analyzed in memory at any given time, reducing latency and improving efficiency for a wide range of analytical applications.

Advancing Performance with Columnar In-Memory

GPUs, FPGA, and 3D XPoint were among the three key hardware innovations that were considered in designing the project. Because Apache Arrow was designed to benefit many different types of software in a wide range of hardware environments, the project team focused on making the work “future proof,” which meant anticipating changes to hardware over the next decade.