Page 1 of 2 next >>

Laying the Perfect Data Foundation for AI


Artificial agents, applications, and bots are popping up across every enterprise landscape, and data is rolling through at a fast and furious pace.

While there has been plenty of hype, excitement, and fear about AI’s possibilities, there has been scant attention regarding the data infrastructure needed to make it all work—especially at the enterprise level. AI requires a strong and well-considered data foundation, whether it’s extended out of legacy infrastructures or part of the next generation of data technology.

“This trend is going to continue in almost every industry, and as AI moves from experimentation to core business operations, the pressure on the data layer also intensifies,” said Karthik Ranganathan, co-CEO and co-founder of Yugabyte. To explore the shape of data foundation in the AI age, we canvassed experts and leaders across the industry who weighed in on the thinking and work required to build a well-functioning data environment responsive to today’s and tomorrow’s enterprise AI initiatives. Remember, it’s all about the data.

ARE WE READY?

There is general agreement among data leaders and experts speaking to BDQ that today’s data infrastructures are not ready for the demands of AI. “Enterprises can power impressive AI demos, but underneath there is still disordered data, identity sprawl, and fragmented platforms,” said Mark Gowdy, chief partner technologist at Quest Software.

Most enterprise data infrastructures “have been built and deployed during the analytics era,” according to Vikram Venkat, principal at Cota Capital. “Most data was structured, row-col­umn-optimized tables were the predominant format, and batch processing was sufficient for most large-scale requirements.”

What’s missing for most enterprises is “a clear, consolidated, well-defined data model,” said Cole Bowden, developer advo­cate at InfluxData. “When you’re trying to use AI to leverage your data, it will explore your tables, your schemas, your col­umn names, and it has to make inferences about how all of the data is related and how plain-text concepts relate to all of the data you have.”

Along these lines, “any column named ‘col1’ is functionally invisible and useless to any AI agent,” Bowden illustrated. “If you have multiple similarly named columns that contain differ­ent data or different concepts, AI will struggle heavily. Your data model is your documentation for AI, and most data models are not clean enough to power AI fully.”

The result is an “AI that can’t be fully trusted, traced, or secured in production,” Gowdy pointed out. “A small minority of enterprises are closing that gap, but only where they’ve delib­erately invested in finding, understanding, and governing their data while consolidating onto modern platforms with strong identity at the center. Until companies fix that housekeeping, it’s akin to dumping a junk drawer of random parts and informa­tion into a scalable environment and calling it AI-ready.”

While creating AI-powered environments gets easier every year, the underlying data architecture can make or break an AI application. This is where things get complicated. “Build­ing on top of existing traditional databases becomes painfully complex,” said Ranganathan. “It requires multiple databases, each supporting a different data model—typically SQL, vec­tor, graph, search, and time/series. Hacking all these databases together creates a huge problem of data silos, which drastically slows down the development, leads to poor performance, and a lack of observability into the system.”

Chetas Joshi, a software engineer at Robinhood, said he has seen firsthand “that most data infrastructure is unprepared to support AI and machine-learning use cases.” AI systems—espe­cially large language model (LLM)-powered applications, real-time decisioning systems, and copilots—“bring very different requirements,” he explained. “Use cases like fraud detection, per­sonalization, recommendations, and RAG [retrieval-augmented generation] rely on continuous streams of high-volume, often unstructured data. They need fresh signals—high-through­put, low-latency data-processing requirements—and the right context.”

The gap between AI capability and enterprise readiness “is less about hardware and more about hygiene,” advised Anu­sha Kovi, business intelligence engineer with Amazon. “Most enterprise data environments were built to store and report, not to feed models. That means the data is siloed, inconsistently labeled, poorly documented, and not structured in a way that makes it usable for AI without significant cleanup work first. The infrastructure exists, but the foundation underneath it was never designed with this use case in mind.”

Practical challenges also need to be considered—“managing costs as data volumes grow, keeping data fresh enough for real-time use cases, handling model drift, and maintaining gover­nance across distributed systems,” Joshi added. “Supporting AI at scale requires high-throughput streaming ingestion, online serving layers backed by intelligent caching layer, vector index­ing, and reliable object storage all working together.”

“Enterprise data is distributed across many systems without a uniform means of access,” said Pascal Van Hentenryck, A. Russell Chandler III Chair and professor at Georgia Technology and recently appointed head of the Gurobi AI Innovation Lab. “Security, privacy, and the need to map roles and users for access control purposes collectively create another layer of complexity. So, the challenge is to move from data to workflows—moving from data infrastructures to workflow infrastructures. It creates a paradigm shift.”

APPLICATIONS FOR AI

One thing is certain: AI-based applications are hungry for data—and lots of it. “From forecasting to optimization to generative and agentic solutions, AI is data-driven and thus data-hungry by definition,” said Van Hentenryck. At this point, AI is showing up in “real-time decisioning, fraud detection, per­sonalization, and operational analytics,” said Gowdy. “You also see LLM-based copilots and assistants sitting on top of ware­houses and lakehouses, answering questions over documents, logs, and reports, plus heavy regulatory and risk analytics that blend structured and unstructured data.”

LLMs are being used to “power better decision-making through more informed analytics, while agentic AI enables the automation of end-to-end processes to increase speed and effi­ciency,” said David Thoumas, co-founder and CTO of Huwise (previously called Opendatasoft). “Both of these rely on access to trustworthy data that is reliable, understandable, and pro­vides context and consistency.”

AI is being drawn toward the richest, most established data­sets. These consist of “data warehouses and back-office systems like CRMs that have grown up alongside the organization itself,” according to Graham McMillan, CTO of Redgate. “These con­tain the messiest but most valuable data an enterprise holds, and that’s precisely why people want to deploy AI against them.”

AI is being employed against datasets, “spot trends in cus­tomer behavior, identify churn before it happens, and find cross-sell opportunities hiding in plain sight,” McMillan said. Additional use cases include “real-time fraud prevention, per­sonalized digital banking, telecom network optimization, logistics coordination, ecommerce recommendation engines, and intelligent retail pricing,” said Ranganathan. For instance, an example is “an AI-driven shopping concierge on an ecom­merce site that may depend on past transactional data, website behavior, locale information, and current conversation context to make accurate decisions instantly, often across globally dis­tributed environments.”

Page 1 of 2 next >>


Newsletters

Subscribe to Big Data Quarterly E-Edition