How AI Relies on a Modernized Enterprise Data Architecture
Enterprise data architecture and data warehouses are being modernized with cloud architectures and database technologies that can deliver the data capabilities required to develop and run analytic models at scale. High-quality data is required for training analytic models, which extends the reach of data management from data warehousing and self-service data analytics to AI. Now more than ever, “garbage in” means “garbage out”—in the form of teaching a machine to make bad decisions.
Data management is not a glamorous part of data science projects, but it is critical for any data science and AI project. Data management that is rigorously applied throughout a modern data platform includes all enterprise datasets from raw data in a data lake, curated by self-service business analysts, and transformed into certified data in warehouses. This makes all enterprise data available for AI projects to build training datasets or process for analysis. Proper data governance ensures that the context of the data has been recognized and captured for evaluation of use. Data quality assessments for accuracy, completeness, and correctness provide the information needed to determine if the data has suitable thresholds to be included in the AI project. Data lineage communicates the story of how the data became available and whether its origination or transformations could impact the AI project.
For decades we have been taught to provide access to raw, unaltered data so that statistical routines could find hidden patterns and signals in vast amounts of data that are unperceivable by analysts. Now, as machine learning and deep learning tackle probabilistic recommendations, they require raw but carefully groomed data that is unbiased and accurately represents learning data cases. Building training datasets for supervised learning in machine learning/deep learning projects benefits from data management best practices developed in and supported by enterprise data architectures. From when data is created and captured, it goes through a lifecycle of transformation into meaning and information that AI projects will leverage and contribute to.
A modern data architecture is comprised of foundational components that allow business analysts and data scientists to iterate through and craft datasets in analytics sandboxes or data lab environments to prepare data for analytic models. Working within the overarching enterprise data platform will lend the additional value that comes from overall data governance and metadata as well.
Iteratively building data preparation workflows typically follows one of three approaches, each leveraging different components of the modern data architecture:
- Many data scientists prefer to develop their Python, Scala, Java, or R code in data science notebooks such as Jupyter or Zeppelin. These notebooks are executed on an Apache Spark server or cluster leveraging SparkSQL or Streams for data ingestion and then transform the data in-memory for the analytics model to consume. The foundational components in use are Spark for an execution engine, fetching data from the data lake, an analytics sandbox, or enterprise data hubs. Because modern data architectures often continuously ingest and process data via a streaming data hub such as Kafka or Flink, the data science notebook can subscribe to that data as well.
- When business analysts are supporting data science projects, perhaps as citizen data scientists, they typically use a data wrangling or data prep tool rather than a data science notebook. Here, the business analysts iteratively explore datasets and build data workflows to produce the training datasets for a data science notebook to process in their analytics model. In this case, the notebooks do not need to prep the data, but rather read it into memory and execute the analytics model. These data prep workflows are well-suited for batch-oriented and scheduled datasets used for initial training or reinforcement training. For continuous-training data that incorporates a feedback loop, streaming data to the data science notebook is better.
- Business analysts may also choose to build data preparation in a data virtualization environment as a series of data integration and transformation database views of underlying databases. If joining keys are available across different databases, very wide denormalized datasets can provide hundreds of features to be leveraged by the machine learning model. Once the business analyst has connected and defined these views (without having to acquire and integrate data or produce an output dataset somewhere), the data science notebook uses SparkSQL to connect to the virtual database and query the records desired for training the machine learning model. This can be especially useful with reinforcement training with the most recent data available.
In each of these scenarios, the data scientists, business analysts, or data engineers are more efficient when leveraging the established modern data architecture with all of the governed enterprise data assets. The data lake can provide the raw transactional or event data (such as mobile, web, or IoT). The enterprise data hub provides valuable reference data that is highly curated, governed, and full of beneficial attributes for analytics and model features. The analytics sandbox accommodates the self-service datasets that business analysts built from data workflows and data wrangling. And the data unification component of the architecture, as database virtualization with cross-database integration views or REST APIs, makes datasets easily accessible and integrated from anywhere in the data architecture.
For more articles like this one, go to the 2020 Data Sourcebook
Cloud computing is one of the most significant catalysts to the rise of AI development. (The other has been the breakthroughs in machine learning and deep learning itself.) Cloud platforms offer a near-limitless amount of compute and storage resources for the power-hungry AI routines to consume. Thanks to the big data era, vast amounts of data can be stored and processed on the cloud with affordable scalability. “Modernizing” enterprise data architecture takes advantage of the separation of storage and computing resources inherent in cloud-native architectures. As a result, AI development is very natural in the cloud, with Amazon, Azure, and Google offering managed services for building training data, machine learning, deep learning, and natural language processing. However, existing data warehouses, analytics databases, and their data integration systems will need to be migrated to the cloud platform to fully modernize the architecture.