<< back Page 2 of 2

Laying the Perfect Data Foundation for AI


SCALING CHALLENGES

AI requires moving from static data to continuous streams of data, which calls for “very different computing infrastruc­tures than other solutions,” said Van Hentenryck. “Most enter­prises are not yet technologically ready for this, although many options exist at this point.” In addition, enterprises need skilled talent to build and maintain these infrastructures. “Scaling AI doesn’t mean you have to fix everything up front, but identifying which databases are in good shape, fixing those that aren’t, and then maintaining them consistently going forward is critical. Without that foundation, expanding AI initiatives introduces risk rather than value,” McMillan said.

Even with the help of AI to write queries or a user interface to select options from, “a human user isn’t going to make requests or leverage your infrastructure more than a few times a minute,” said Bowden. “An AI agent can run 10 queries in a second, hun­dreds per minute, and more agents can be working in parallel than humans. Infrastructure that worked fine for a few humans working at human pace is going to be woefully underequipped to handle dozens of agents retrieving data out of the database every second of every day.”

Data governance also needs the full attention of data lead­ers and their business counterparts. “As data volumes grow and more systems get connected, the question of who owns what data, what’s allowed to be used for what purpose, and how you audit that becomes genuinely hard,” said Kovi. “Latency is another issue. Real-time AI applications need data that is cur­rent, not hours old. Most enterprise pipelines weren’t built for that speed.”

What is needed is “a ground-up redesigned data layer that brings together data across the inference layer, memory layer, and database layer,” said Ranganathan. Applications need to be able to “scale and survive any downtimes and failures.” Impor­tantly, all of this must scale, Ranganathan emphasized. “If every new feature requires you to spin up a specialized database, development slows down, operations become complicated, maintenance costs increase, and the data layer becomes a silent productivity killer.”

To address these problems, data managers “need to radically simplify your stack by converging data capabilities into a sin­gle, scalable system,” said Ranganathan. “Instead of managing traditional databases, vector stores, graph systems, scalable key-value stores, and search engines all separately, you need to unify them within a single familiar, PostgreSQL-compatible infrastructure.” The bottom line is that “data needs to be made available to AI and human users,” Thoumas said. “Companies need to move beyond simply cataloging their data to making it accessible and consumable through self-service data product marketplaces.”

ESSENTIAL INGREDIENTS

Successful AI applications sit on top of “a complicated stack of databases, including relational databases, vector stores, graph databases, RAG pipelines, and indexing systems,” said Ranga­nathan. “Each component often runs as a separate system and must be deployed, scaled, secured, and monitored, adding addi­tional layers of operational complexity.”

To address this complexity, consider the following measures:

  • Work toward greater cohesion across all data in the enterprise—“Structured and unstructured multimodal data live in separate systems, and metadata is incomplete or stale,” said Joshi. “A strong semantic layer—capturing business definitions, lineage, and relationships—becomes critical. And it can’t be static. It needs to evolve through feedback loops, human input, and model interactions so that the system’s understanding of the business improves over time. For LLM-powered systems, metadata and con­text are part of the reasoning process itself.”

AI-ready database infrastructure “should be adept at handling both structured and unstructured data,” accord­ing to Venkat. “It should include polyglot persistence by default, enabling efficient normalization across batch and streaming data from relational, graph, vector, and other databases.”

  • Seek enhanced support for AI workloads—AI work­loads present new challenges and requirements for data sites. This starts with “coordinating heterogeneous, accel­erator-driven workloads,” said Jaikumar Ganesh, head of engineering at Anyscale. “Modern AI pipelines combine CPU-bound preprocessing, GPU-bound training and inference, and, increasingly, reinforcement learning or evaluation loops. These stages have different scaling char­acteristics and resource requirements. Pipelines often fan out into thousands of fine-grained tasks connected by data and control dependencies.”
  • Implement monitoring and observability for greater transparency—The most critical technologies to AI “are those that give you visibility and control over your data estate as it evolves,” McMillan said. “You can’t govern what you can’t see, so monitoring and observability tools are foundational. They allow teams to understand how their data environments are actually behaving, spot anomalies, and identify issues before they affect downstream systems or, worse, before AI starts producing outputs based on data that’s quietly degraded. As environments grow larger and more complex, that early warning capability becomes increasingly important.”

Data catalogs and observability tools are also key ingredients to an AI-ready data foundation. “Vector databases are becoming essential for anything involving unstructured data or language models,” said Kovi. “And honestly, solid data pipelines with real monitoring matter more than whatever the newest platform is. The boring infrastructure stuff is what actually determines whether AI projects work or fail.”

  • Build a data layer—The long-term success of AI, Ranga­nathan said, “depends on consolidating and strengthen­ing the data layer first, starting with a unified data foun­dation for AI that supports a distributed architecture for resilience, high availability, and seamless scaling.”
  • Look to AI to better manage AI—While AI taxes data infrastructures, it also provides a solution. Today’s data infrastructures “are under constant pressure. They’re being continuously updated with new operational data flowing in from the transactional platforms, SQL Server, Oracle, MySQL, PostgreSQL, which keep the business running day to day,” McMillan said. “Keeping that data reliable, well-governed, and accessible only to those with the right to see it is not a small task, but it’s one that AI is increasingly being applied to help manage.”

AI usage in database management “rose from 15% to 44% in a single year within a survey conducted by Redgate,” McMillan relayed. “This steep change … reflects just how quickly orga­nizations are recognizing that managing the data for AI is as important as deploying AI on the data.”

<< back Page 2 of 2


Newsletters

Subscribe to Big Data Quarterly E-Edition