RAG in 2025: State of the Art and the Road Forward — Tengyu Ma, MongoDB (acq. Voyage AI)

Introduction to RAG and Alternatives 00:28

  • RAG (Retrieval Augmented Generation) is crucial for applying Large Language Models (LLMs) and agents to enterprise data, as out-of-the-box models cannot access proprietary information without data leakage.
  • Methods to ingest proprietary data include fine-tuning, long context, and RAG.
  • Long context involves dumping all documents (potentially billions of tokens) into an LLM's context window for a query response.
  • Fine-tuning involves updating an LLM's parameters by training it on proprietary data, then using these updated parameters to generate responses.
  • RAG dynamically retrieves a subset of relevant documents based on a query and feeds these to the LLM to generate a response.

Why RAG is Preferred 02:27

  • The speaker believes fine-tuning is difficult, unnecessary, and problematic for forgetting knowledge or data governance.
  • Long context is considered inefficient and not cost-effective in the long run, akin to scanning an entire library for every question.
  • RAG is preferred for its simplicity, modularity, reliability, speed, and cost-effectiveness, mirroring how humans retrieve relevant information from a library.

RAG Implementation and Recent Improvements 04:38

  • RAG implementation involves embedded models (vectorizing documents and queries), a vector database (storing and searching vectors), and an LLM (generating answers).
  • Significant improvements in retrieval accuracy have been observed in the last two years, with new models offering better accuracy at lower costs and improved scaling.
  • Current average retrieval accuracy is about 80% across 100 datasets, with some tasks achieving 90-95% accuracy.
  • Matryoshka learning and quantization techniques reduce vector storage costs by using lower-dimensional or lower-precision vectors, enabling 10x to 100x savings with minimal performance loss.

Techniques for Better RAG 08:13

  • Using superior embedding models is the simplest way to enhance RAG performance.
  • Hybrid search combining lexical search with re-rankers can improve results.
  • Query and document enhancement techniques include query decomposition (breaking queries into sub-queries) and document enrichment (adding metadata or LLM-generated context to document chunks).
  • Domain-specific embeddings, customized for particular fields like code, offer improved performance and better storage cost-accuracy trade-offs.
  • Fine-tuning embedding models with custom data and using additional retrieval layers (e.g., graph, iterative retrieval) can also enhance RAG.

Future Vision for RAG 10:43

  • RAG is expected to remain a fundamental component of AI systems due to its efficiency in handling large datasets by selectively retrieving relevant information, similar to human cognition.
  • The future of RAG will see a shift towards more capable "model layers" (embedding models, re-rankers, LLMs), reducing the need for complex "tricks" like intricate parsing and chunking strategies.
  • Multimodal embeddings will dramatically simplify workflows by directly processing various data types (screenshots of documents, tables, video frames) into vectors, eliminating the need for separate extraction and embedding processes.
  • Context-aware auto-chunking embeddings will automatically segment long documents, providing vectors that represent both the specific chunk and global meta-information from other chunks, optimizing cost and focus.
  • Future plans include offering fine-tune APIs for users to fine-tune embedding models with their own data.