Conquering the complexities of unstructured data to fuel generative AI (GenAI) apps remains a crucial challenge for many organizations. If said unstructured data can be delivered in clean, canonical JSON, this will define your data layer’s success; yet if done poorly, your data layer will be the downfall of your application.
From Surviving to Thriving With GenAI in Production: Lessons To Successfully Scale Your Data Layer, DBTA’s latest webinar, featured the expertise from Amy Ghate, solutions architect, Elastic, and Virginia Stehle, solutions architect, Unstructured, as they explored how to build robust, clean GenAI data systems, from proper ETL pipeline creation to connecting data to downstream retrieval-augmented generation (RAG) apps.
As the title of this webinar suggests, we’re in a time of survival, defined by the GenAI wave as proof of concepts struggle against data bottlenecks when they’re brought into production. From data pipelines to embedding models and vector databases, one thing is clear: GenAI needs data and business context.
However, 80% of data is trapped in unstructured file types, leaving a wealth of data unutilized by the systems that thrive on it. This is further compounded by the fact that data pipelines often require huge, dedicated teams just for data transformation and pipeline maintenance, making the challenge of utilizing unstructured data even more complex.
To overcome these obstacles, “What a lot of companies end up creating… [is a] DIY data layer…[which] we call a rat’s nest,” said Stehle. “[This rat’s nest] is a reality of…multiple components, custom code, a third-party library…that all need to be integrated for every single step.” A patchwork data layer, outside of its exorbitant costs and required maintenance, also leads to eventual obsolescence when new paradigms emerge or components evolve.
This is the essence of GenAI survival mode, juggling bespoke data connectors, complex, low-quality partition, chunking, and embedding, tool proliferation, poor search relevancy, and engineering talent misallocation. On the other hand, a thriving GenAI estate is defined by:
- Ready to use connectors for GenAI with rich metadata
- High-quality data in consistent canonical JSON
- Centralized and efficient toolset
- Highly relevant retrieval, high-trust
- Engineers focused on end user features
Unstructured streamlines the enterprise tech stack by ingesting data with over 40 different source connectors, supporting over 65 different file types, bringing data into its pipeline. With three different transformation strategies for partitioning, integrations with third-party large language models (LLMs), Unstructured transforms the data into canonical JSON with over 30 different metadata fields. From there, Unstructured can apply a variety of enrichments, including chunking, embeddings, and custom integrations. This simple, stable, scalable JSON is then directed to the destination or vector database of the customer’s choice—including Elastic.
Elastic’s vector database, Elasticsearch, provides the full scope of necessary capabilities for RAG applications, beyond those provided by point-solution vector databases. These include automated chunking, role-based access control (RBAC), document-level security, search analytics, the choice and flexibility of embedding models, and more.
This is only a snippet of the full From Surviving to Thriving With GenAI in Production: Lessons To Successfully Scale Your Data Layer webinar. For the full webinar, featuring more detailed explanations, demos, a Q&A, and more, you can view an archived version of the webinar here.