SUMM

RAG (Retrieval Augmented Generation) is crucial for applying Large Language Models (LLMs) and agents to enterprise data, as out-of-the-box models cannot access proprietary information without data leakage.
Methods to ingest proprietary data include fine-tuning, long context, and RAG.
Long context involves dumping all documents (potentially billions of tokens) into an LLM's context window for a query response.
Fine-tuning involves updating an LLM's parameters by training it on proprietary data, then using these updated parameters to generate responses.
RAG dynamically retrieves a subset of relevant documents based on a query and feeds these to the LLM to generate a response.

The speaker believes fine-tuning is difficult, unnecessary, and problematic for forgetting knowledge or data governance.
Long context is considered inefficient and not cost-effective in the long run, akin to scanning an entire library for every question.
RAG is preferred for its simplicity, modularity, reliability, speed, and cost-effectiveness, mirroring how humans retrieve relevant information from a library.

RAG implementation involves embedded models (vectorizing documents and queries), a vector database (storing and searching vectors), and an LLM (generating answers).
Significant improvements in retrieval accuracy have been observed in the last two years, with new models offering better accuracy at lower costs and improved scaling.
Current average retrieval accuracy is about 80% across 100 datasets, with some tasks achieving 90-95% accuracy.
Matryoshka learning and quantization techniques reduce vector storage costs by using lower-dimensional or lower-precision vectors, enabling 10x to 100x savings with minimal performance loss.

Using superior embedding models is the simplest way to enhance RAG performance.
Hybrid search combining lexical search with re-rankers can improve results.
Query and document enhancement techniques include query decomposition (breaking queries into sub-queries) and document enrichment (adding metadata or LLM-generated context to document chunks).
Domain-specific embeddings, customized for particular fields like code, offer improved performance and better storage cost-accuracy trade-offs.
Fine-tuning embedding models with custom data and using additional retrieval layers (e.g., graph, iterative retrieval) can also enhance RAG.

RAG is expected to remain a fundamental component of AI systems due to its efficiency in handling large datasets by selectively retrieving relevant information, similar to human cognition.
The future of RAG will see a shift towards more capable "model layers" (embedding models, re-rankers, LLMs), reducing the need for complex "tricks" like intricate parsing and chunking strategies.
Multimodal embeddings will dramatically simplify workflows by directly processing various data types (screenshots of documents, tables, video frames) into vectors, eliminating the need for separate extraction and embedding processes.
Context-aware auto-chunking embeddings will automatically segment long documents, providing vectors that represent both the specific chunk and global meta-information from other chunks, optimizing cost and focus.
Future plans include offering fine-tune APIs for users to fine-tune embedding models with their own data.

RAG in 2025: State of the Art and the Road Forward — Tengyu Ma, MongoDB (acq. Voyage AI)