Playing with the Potential of Vector Databases at Data Summit 2024

Machine learning is primarily concerned with accuracy and pattern recognition, while natural language processing (NLP) is concerned with computer-human language interactions, specifically how to program computers to process, and analyze large amounts of natural language data.

At Data Summit 2024, Christy Bergman, developer advocate, Zilliz, presented how to use an open source vector database to power GenAI chatbots, ecommerce recommenders, and similarity search-based apps during her workshop, “Taking Advantage of Machine Learning & Natural Language Processing.”

The annual Data Summit conference returned to Boston, May 8-9, 2024, with pre-conference workshops on May 7.

“Right now, there’s a lot of different vector databases out there,” Bergman said.

One of the problems with GenAI is the “hallucination” issue, she explained. These foundational models are trained on sequences of tokens. These sequences are masked, and the model predicts what information lies behind it. However, it can pull inaccurate information based on what the models were trained on.

Vectors come from Large Language Models (LLM), previous iterations were deep learning models, which has learned mappings of inputs and outputs. Those outputs are vectors. If that model is open source, they can be used, Bergman said.

Retrieval Augmented Generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources. This can help reduce the chance of “hallucinations.”

By taking sparse vector models and dense vector models, you can create a hybrid search that combines keyword search with semantic search, according to Bergman.

Milvus is an open-source vector database designed to handle high-dimensional vectors efficiently, maintained by Zilliz developers.

In the context of NLP and document retrieval, vectors are numerical representations of text or other data points in a high-dimensional space. These vectors capture the semantic meaning and relationships between data points, allowing for similarity-based searches and comparisons.

For the rest of the workshop, Bergman walked attendees through a practical project, embedding GitHub project documentation, storing vectors in Milvus, and conducting structured queries to answer questions based on the documentation.

Many Data Summit 2024 presentations are available for review at