According to the “three V’s” definition of big data (Volume, Velocity, Variety) big data is as much about the increasing rate of data requisition as it is about increasing volumes. Big data and NoSQL databases emerged as much as a response to handling scalable transaction rates as to handling increasing storage volumes.
At the same time that NoSQL gathered steam, another transformative technology was taking hold in the database world. Since the earliest days of commercial computing, magnetic disk devices have represented the slowest component in the application stack. And although Moore’s Law—which states that electronic components exhibit exponential increases in capability—has resulted in massive improvements in CPU power, memory capacity, and storage density, it has had little effect on the speed of magnetic disks which are subject to mechanical rather than electronic constraints.
The Promise of Flash SSD in Improving Database Performance
Solid State Disk (SSD)—particularly flash SSD—promised to revolutionize database performance by providing a storage media that was orders of magnitude faster than magnetic disk, offering the first significant improvement in disk I/O latency for decades. However, flash SSD has to overcome at least two obstacles before it can become a mainstay of database architectures, especially in the context of big data.
Firstly, while flash SSD provides higher I/O rates—effectively a lower cost per I/O—flash disks have significantly less raw storage capacity and effectively have a much higher cost per TB of storage. The cost of flash storage is reducing, but it is not reducing much faster than the reduction in storage costs for magnetic disk. Meanwhile, organizations are being challenged to store ever greater quantities of raw data, usually ruling out a complete conversion to flash storage.
Secondly, while flash storage always provides excellent read I/O latencies, write operations can be more problematic. Writing to an empty page of flash storage is slower than reading that page but still pretty fast. However, over-writing an existing page involves an expensive block erase operation which is typically not much faster than a write I/O for a magnetic disk. Databases that perform large amounts of update operations can therefore experience disappointing performance from flash SSD.
Aerospike is a NoSQL database that attempts to provide a database architecture that can fully exploit the I/O characteristics of flash SSD. Aerospike implements a log structured file system in which updates are physically implemented by appending the new value to the file, and marking the original data as invalid. The storage for the older values is recovered by a background process at a later time.
The Influence of Flash SSD in NoSQL and Big Data Architectures
Aerospike also implements an unusual approach in its use of main memory. Rather than using main memory as a cache to avoid physical disk I/O, Aerospike uses main memory to store indexes to the data, while keeping the data always on flash. This approach represents a recognition that, in a flash-based system, the “avoid I/O at all costs” approach of traditional databases may be unnecessary.
Flash remains a far more expensive media for storage of bulk data, but for systems which have a raw I/O throughput requirement, the extra cost of flash may be less excessive in practice. Many high-end database implementations sacrifice storage density on magnetic disk in order to achieve higher I/O throughput, either by “short stroking” (getting better performance from individual disks by reducing the amount of data stored on each one) or by implementing redundant database servers to add to the collective I/O capacity.
Databases such as Aerospike may currently be most attractive for a relatively small number of applications with very high transaction rates. However, there is no doubt that the influence of flash SSD in NoSQL and big data architectures will inevitably increase over time.