On Dec. 11, 1998, NASA launched the Mars Climate Orbiter to gather data on the Martian climate. On Sept. 23, 1999, the orbiter was lost due to a data error. One piece of equipment was reporting force in foot-pound seconds, while another was expecting SI Units. This kind of data miscommunication happens all the time between databases. Let’s look at what you need to build machine learning models at scale, what the current options are, and why “feature stores” are becoming a new trend in this space.
Machine Learning at Scale
Machine learning models rely on rich, robust data in order to drive their insights. Most gains in model performance come from rich-feature engineering and data manipulation work, rather than cutting-edge models and expensive hyper-parameter optimizations. To support these features, you need a system that can provide sophisticated aggregations, robust and accurate point-in-time historic information, and the ability to compile numerous data sources together. To do machine learning at scale and in production, you need all of these things—in addition to stability, reliability, and lightning-fast response times. So, how can the current OLAP and OLTP systems handle these needs?
For more articles looking at the current state of technology and what's ahead for 2021, download the Data Sourcebook.
OLTP systems are primarily used to run a business’s real-time systems. These systems have been optimized for lightning-fast information retrieval in order to power websites, track transactions, or handle other mission-critical business needs. These systems work well for this.
OLAP systems are designed to provide deeper intelligence and information on the historic state of the business. These systems can be used to understand business metrics, power forecasting systems, and support complex queries that aggregate and calculate information across multiple transactions and systems. OLTP systems typically process orders of magnitude more queries than OLAP systems and are expected to run in milliseconds. Conversely, OLAP queries are generally measured in seconds or minutes (or hours ... to the frustration of your organization’s data engineers).
OLAP and OLTP Challenges
For many model applications, an OLAP system can be perfectly sufficient for your needs. Models which are infrequently run, and can solely rely on the OLAP database for information, can be served from this system. Unfortunately, if your models need to return results quickly and reliably, this will break down when putting your models into a real-time production environment. In order to serve these models, your OLAP system is now being asked to behave as an OLTP system: a highly available, rapid response system serving orders of magnitude more queries. The common solution then is to use the OLAP system to train the models and the OLTP system to serve the models. This has three possible outcomes:
- OLTP performance is degraded by giving it OLAP responsibilities.
- Models suffer by limiting the models to only use data and capabilities that exist in both systems.
- You accrue data errors between the two systems, which can result in catastrophe (like the Mars Climate Orbiter). Instead of getting the best of both worlds, you get the limitations of both.
There’s an old Confucian proverb: “The hunter who chases two rabbits catches neither.” Instead of trying to repurpose your analytics or production databases to also serve your machine learning models, consider building a database specifically for them. Just as you wouldn’t want to run analytics on your production database, or serve your website on your analytics database, you shouldn’t use them for your machine learning systems either.
What’s Ahead
So what does a system designed specifically for machine learning look like? On its surface, it looks similar to an OLAP system: It’s available for querying and historic lookups and new feature generation by data scientists and machine learning engineers. This gives it the flexibility and robustness needed in order to do the exploration, analysis, and training necessary for machine learning models. It also needs to have available all the current values of all the features being used for fast querying and lookups. Finally, it needs to be able to take real-time data updates and use them to update all of the computed features in a fast, reliable manner.
Large companies are building out their own versions of these systems, which are often called “feature stores.” Airbnb is building Zipline, Uber is building Michelangelo, and Google is building Feast. Most of the big companies doing machine learning at scale are developing in this direction. The exact architecture of these systems varies from company to company, but all handle both the OLAP and OLTP needs of machine learning models and systems.
OLAP and OLTP systems are reliable, sturdy systems designed for their particular use cases. However, when building machine learning systems at scale, you need a system that has the power of both. You need a feature store.