Columnar database technology burst on the data warehouse scene just a couple years ago with promises of faster query speeds on vast amounts of data. They delivered on that promise, but at a cost that is no longer worth paying. Here's why.
Data - big or small - is naturally row-based. It is how database systems organize information as it's collected. Rather than being a new and better "size" alternative in a "one-size-fits-all" world, columnar databases perform unnatural acts of indexing on row-based data stores for one primary reason - to reduce the number of query-slowing input/output (I/O) calls against spinning hard drives; which is very much a 2005-type problem, when you stop to think about it.
At one time, hard drive spindle speeds were a serious data analysis bottleneck. While processors gained speed and cores by leaps and bounds, hard drives lagged in their ability to quickly find and read data. Higher capacity drives only exacerbated the speed gap between disks and processor chips. Much faster solid state memory in the form of on-board flash RAM offered little relief as it was low in capacity and high in price.
Because this problem persisted for years, the cumbersome data manipulations and associated IT complications of columns seemed worthwhile. But technology, like time, marches on and the calculus has changed - significantly.
Today, even low-cost commodity server platforms have access to terabytes of lightning fast RAM, making analytic reads from disks strictly optional. Executing through hyper-threaded, multicore processors, a new generation of in-memory, massively parallel processing databases - architected from the ground up to take advantage of abundant RAM - deliver the query performance of column-based databases with none of the cost and complications brought on by the aforementioned unnatural acts of data mutilation. But that's not the only advantage.
Columnar databases impose a subtle, but very real, constraint on query parameters. To achieve its goal of limiting disk I/O, columnar databases query against what are essentially indexed summaries of the original row-based data store. Each row of data must be first split into its component column values and each of those values must then be written to a different place within the database. These indexing schemas must be correct for them to work properly - which can require multiple iterations to determine formats - and even if they work properly a close examination of the process reveals that user queries are being limited by the index scheme itself. Without access to the original data in its original form, true ad hoc querying of the data is not possible. Rather, users are restricted to queries that conform to categories of comparison, called projections, anticipated by the original indexing. People think of the oddest questions and odds are column indexing can't cover all scenarios.
This columnar requirement to alter the original data structure introduces other practical issues, not least of which is operational latency when attempting to conduct more complex queries to perform more sophisticated analysis. For example, imagine a retailer using a columnar data warehousing system wants to run complex basket analysis type queries on a large data set, say something over 5TB. Due to multiple fact tables and complex joins, it will take days to get the columnar database set up as the schema has to be constructed multiple times to get it working right before data can even be loaded and analyzed. Further, updating the data warehouse with fresh data is not a straight-forward process, causing columnar database vendors to employ complicated tricks to do updates in a reasonable timeframe. Overall, from an IT management perspective, columnar data makes life complicated. Complexity is costly. For companies seeking the agility and advantages of near-real time analysis, this type of latency between data collection and data anlaysis is a real problem only exacerbated by the information firehose effect that is big data.
In addition, this need to index and project can significantly diminish another one of the hyped benefits of columnar databases - compression. Because they write indexes of the actual data, columnar databases are touted as providing exponential levels of data compression; a very attractive proposition for companies dealing with the aforementioned big data fire hose of information. What's less publicized, however, is the effect that creating multiple indexed projections has on this benefit. As data sets get larger and more complex, the need to do more complex queries typically increases. This, in turn, multiplies the number of projections that are created. Fairly quickly, this can significantly reduce the intial compression benefit of indexing. In fact, one well-known columnar database purveyor recommends having as much disk as the uncompressed data.
Initially, a leading gambling operator that wanted to up its player analytics game and was initially interested in using a columnar database technology because of the high compression rates that the vendor of the solution said made it unique. However, because data had to be duplicated over and over in order to build so-called projections in the database, the benefit of having a high level of compression was lost as the gambling operator still had to have as much disk available as the total amount of uncompressed data.
By comparison, in-memory analytic databases can maintain the original row-based data structure while letting multiple users run queries at train-of-thought speed. Unbounded by the constraints of columnar indexing, users are free to explore any and all possible relations present within the data. Further, because the data structure is preserved and information passes quickly and easily from collection point to data warehouse, users are assured they're working against near real-time data that accurately depicts the current lay of the land.
And the future is always arriving. Even now, industry leader Intel is reportedly working on new CPU technology that would enable 46-bit address spaces, overcoming a longstanding limit and thus allowing up to 64 terabytes of addressable RAM on a single server.
Memory technology itself is being pushed toward exciting new advances with direct application in and benefit for data warehousing and analytics. Dynamic RAM today is fast and getting faster on a regular basis, but it's a volatile medium. If you lose your power, you lose your in-flight data - making disks a necessary safe harbor for persistence. But that present-day reality appears poised for a slide into the rearview mirror as years of research into different forms of non-volatile RAM (NVRAM) appears poised to alter the commercial landscape for fast, persistent, enterprise-class memory. The NVRAM just ahead in the commercialization pipeline will be a significant leap beyond the flash memory of today, which, though offering faster performance than spinning disks, is not up to DRAM speeds and suffers from data reliability issues under constant, heavy use. The next generation technologies for NVRAM, such as phase-change memory (PRAM), are promising to deliver something very close to universal memory; offering performance that eclipsing both RAM in speed and spinning disks in data durability.
Make no mistake; these technologies will not arrive via special delivery next week - or even next year - shrink-wrapped and ready for deployment at scale at fire sale prices. The trend, however, is clear and inexorable toward more and persistent memory, more efficient use of increasingly capable multi-core CPUs and increased bandwidth tying these platforms together.
So, as spinning disks and other limitations they grew up with fade in the face of continued technological progress, will columnar data analysis disappear? In a word, no - it's more likely columnar will become a feature or capability within of a larger, more capable solution
For instance, in applications where careful effort has been made to tune the data for query performance and there is a need to run the same set of real-time queries over and over again, columnar indexing might make sense. But again, it's a capability that can be offered as a feature set in an in-memory offering. More typical in businesses grappling with getting the most out of their big data investments is a scenario where a broad range of user types are seeking to ask streams of ad hoc, often fairly complex questions, against at least near-real-time information. For the reasons explained above, adapting a columnar architecture as the data warehouse engine in such a scenario will pose significant costs in terms of operational cost and complexity.
So why go to all the trouble of putting data into columns if you don't have to? Columnar databases were invented to solve yesterday's problems. It's time to look forward.