The market is saturated with thousands of purpose-built database systems. This creates too much noise for analysts and scientists dealing with vital data problems across numerous important application domains: genomics, geospatial, and more.
To solve this problem, Stavros Papadopoulos, founder and CEO at TileDB, Inc., is introducing the concept of a universal database, which had heretofore been a missed opportunity in the database community. TileDB, Inc. envisions building the next evolution of the “database,” consolidating multiple data modalities, reproducible runnable code, and versatile compute into one product.
Prior to founding TileDB, Inc. in February 2017, Papadopoulos was a senior research scientist at the Intel Parallel Computing Lab and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for 3 years. He also spent about 2 years as a visiting assistant professor at the Department of Computer Science and Engineering at the Hong Kong University of Science and Technology (HKUST).
What are the limitations of purpose-built data systems?
There is a paper by Michael Stonebreaker that emphasizes “one size doesn’t fit all” for databases. There are certain applications that popular databases cannot handle efficiently, so the move is to make another database.
The benefit of building a system from scratch for a specific use case is if your organization only does one thing, you buy a special database and you’re golden. The limitation comes when organizations deal with more than one data type. In that case, you’re buying 10 purpose-built databases; license costs skyrocket.
Data fabric or data mesh—these are the consequences of specialty databases. Organizations have too much software to deal with and no holistic governance to deal with it.
Can you explain the concept of a universal database?
Purpose-built data systems are sometimes discussed as the root of [all] evil, but there’s no one to blame because in the ’70s, when relational databases were developed, we only had tables. People weren’t thinking about other types of data. A universal database will handle all workloads in one system.
Its requirements include a piece of software that serves different types of services; it needs a common core. Building [them] in-house and connecting them together isn’t universal; it doesn’t scale. You need certain components in the code that are common for all use cases. Otherwise, it is not a universal database, it is a universal aggregator.
Performance is the second required component. You cannot say it can do everything but do it in a subpar way. If it’s going to be called a universal database, the core components need to perform extremely well. If it sounds too good to be true, there’s a key missing to do it.
How can TileDB help with this?
I spent my research years at MIT on a hypothesis of: Can this even exist? Without performance, you can’t build this. What we found is, there’s a data structure. If implemented properly, it has tensor/multidimensional functions. It can recognize objects that people know from Python.
This data structure needs to be flexible; it needs to handle both dense and sparse workloads. Sparse arrays require special treatment. That’s why the array systems aren’t universal. This is the key: A universal database needs to be able to handle sparse use cases.
If built on top of a multidimensional array, this data structure does something interesting. You can build this database with 90% of a common code; then it doesn’t matter the type of workloads. You can build the data structure in the same way.
If the build structure is one-dimensional or two-dimensional, this database shapeshifts. If you build for one thing, it can become one-dimensional for files or any dimension for tables.
I can prove it with financial time series, machine learning, and more. I have little doubt this can exist and can work very well.
How will universal databases factor into the future of data management?
In an ideal world, the community, users, customers, practitioners, and data engineers will be able to use the same database once they understand how this works.
TileDB has been in development for a very long time, which is changing, and we are moving into the future. Once people do understand the value and play with it, there’ll be a learning curve with arrays. The rest of the market will follow those customers. Once this is understood, in an ideal world, there would be no more specialized databases.
In an ideal world, specialized databases won’t have to be built, and those built will have to be converted to a universal database. Companies will be forced to start adding this data. A universal database will be a simpler and inexpensive way to solve problems.