The Next Information Processing Super Engine

Apr 11, 2012

By Joachim Wester

The database industry is changing. Some internet applications such as search engines and social networks have rolled out their own scale-out technologies, such as NoSQL and MapReduce, effectively ignoring the traditional database, and essentially accusing it of being too underpowered. The database titans in turn remind us why consistency, schemas and query languages matter. Both camps point out considerable weaknesses in each other's approaches. The question is, what will power the future information systems?

The Current State of the Industry

The database industry has been a master of keeping the status quo. Even though business software customers have long suffered from poor performance and high license costs, the industry has been able to resist change. But a perfect storm is forming. Internet is rapidly giving birth to interactive services, like Twitter, Facebook, new social music services, etc. that can no longer be supported by the old engines. Younger developers have shown they are open to new technologies and that database vendors and their products are no longer holy institutions that cannot be replaced.

That being said, the database options for these young developers are limited. Today's proven databases come with performance issues and are cumbersome to use compared to some new alternatives. At the same time, the scaled-out solutions, also known as big data solutions, require developers to sacrifice data consistency and call for a large amount of hardware to achieve decent performance.

While rhetoric and trends fuel the heated discussions on what to do with the database, the underlying problem is scientific and can be boiled down to two scientific principles - Eric Brewer's CAP theorem, also known as Brewer's theorem, and Albert Einstein's famous general relativity. The CAP theorem basically states that a distributed computer system cannot simultaneously guarantee consistency, availability and partition tolerance. it says that a distributed computer system can simply satisfy any two of these guarantees simultaneously, but not all three. As a result, this theorem has proven mathematically that real time signals must be sent between servers in a data center or in the cloud to enable data to be consistent. Consistent data is nice to have for all applications but is absolutely necessary for business transactions when you deal with money and quantities, or any other data that is updated incrementally. You simply cannot add more machines to your data center, keep data available and keep the information shared and uncorrupted all at the same time without sending these real-time signals between the servers.

Einstein's theory of relativity has put a speed limit to these signals, and this is now limiting our information systems. It takes a PC a few picoseconds to perform an instruction, but it takes over three thousand picoseconds for light to travel a single meter. As the CPU becomes faster, the gap between computer nodes and internal CPU power continues to grow.

If you combine the findings of Einstein and Brewer, it becomes clear that you are only allowed two of the following three capabilities at the same time:

1. Scale-out Performance Gains (add more servers)

2. Consistent Data (ability for many users to read and write without corrupt data)

3. Shared Everything Data

This is why, after 40 years, there is not a simple solution to the database problem. The TPC-C benchmark used by some vendors is a good illustration of this. When the big vendors want to show great performance using the Transaction Processing Performance Council's (TPC) first benchmark, known as the TPC-C, they create a lot of almost isolated warehouses to store data. Every customer belongs to only one of the many thousands of warehouses. As both orders and stock are associated with only one warehouse, each warehouse can act as an isolated database, therefore this is a theoretical example that isn't applicable to reality.

The NoSQL vendors also like to show good numbers in terms of performance. Instead of giving up option 3 (shared data) they typically give up consistency. This is why, for example, a comment might show up in Facebook, then disappear only to show up again later. While data that you simply add to the database will end up appearing eventually, as mentioned in the previous Facebook example, data that is updated with the basis of other data, like financial transactions, may end up wrong and corrupt.

The New Information Super Engine

While sacrifice is a reality in many situations, is there a way that database technology today can fulfill all three requirements - strong performance, scale, consistency and shared data?

The concept of 'scaling in' rather than 'scaling out' might just be the ticket. By shortening the distance that information needs to travel we can cope with both the CAP theorem and Einstein's theory of relativity. This simply means that the distance between code and data is lessened by orders of magnitude. The disk storage is moved into RAM and the interaction of information and application code is travelling within the CPU instead of over network cards and computer nodes in large data centers. But if the solution is so obvious, why hasn't it been done earlier? The simple answer is this. Until a few years ago, this solution was not feasible. Let's take a look.

Before the current generation of computers, PCs were 32-bit and could only address a few gigabytes of RAM. But we've come a long way in a short amount of time. PCs are now 64-bit and can address many terabytes of RAM. In addition, RAM prices have fallen 99.998% since the birth of the first SQL databases in the 1970s and more than 90% since the turn of the millennium. Because the concept of scaling in means shrinking the architectural scale at which the database operates, it also means replacing primary memory with the CPU cache and replacing the disk with RAM. When this is done, the disk serves as a mere backup media to host the change log and database image checkpoints. As the disks spin very fast, terabytes of changes can be easily secured in a short period of time since the disk arm does not have to move very much. By scaling in rather than scaling out, the problem of inconsistent data is eliminated. Instead, this method provides consistent data, cheap hardware (fewer servers) and simplified development.

An even more extreme example of scaling in is to merge the database management system with the programming language environment. The transient memory of the computer language, such as Java or C++, and the RAM memory used by the database can be merged into a single entity. This is very different from moving the database to RAM as information is still moved to and from the database when your code makes use of it. It essentially merges the memory of the virtual machine computer language (VM) and the database management system (DBMS) into a combined virtual machine database management system, or VMDBMS, which allows the data to stay put. This provides for more transactions per second on a single PC than could be performed over a very large cluster in a data center. By employing the VMDBMS, developers have simple, consistent data that performs better than a very large data center. And as user written code can operate on the data with thousands of times less overhead (using RAM), methods such as Map-Reduce and other NoSQL concepts can be employed much more efficiently.

A glimpse into the future

It is clear that as the technology landscape changes significantly, and more innovative real-time applications come into the picture, the database itself will need to evolve accordingly to support consistency, higher performance and shared data.

The database of the future should combine the two extremes of scaling in and scaling out, and distinguish between the different types of data that you have. When consistency is key, as it is in financial transactions, databases will need to scale in. But when working with social networks like Facebook, where it may not be critical that each post is consistently updated, then databases will need to scale out.

The next information super engines must allow us to operate in a real-time world, while ensuring the three key components of performance, consistency and shared data. The next-generation large enterprise ERP system will be written using less complex code, will employ more than one server just for redundancy and will allow more fine-grained and potent data models. We will expect everything to happen without delay and we expect data to be correct and up to date. And on a larger scale, this foundation will enable the entrepreneurs of tomorrow a platform on which to develop new applications that will be equally unimaginable to us as Facebook, Google and World-of-Warcraft were to the generation before us.

About the author:

Joachim Wester founded Starcounter in 2006.