GraphRAG methods to create optimized LLM context windows for Retrieval — Jonathan Larson, Microsoft

Introduction & Core Principles 00:18

  • Jonathan Larson from Microsoft Research's graph team introduces the "graphite paper" and its impact.
  • Key enablers for effective AI applications include LLM memory with structure and agents paired with these structures.
  • The presentation will cover GraphRAG applied to coding, the new Benchmark QED release, and new results on lazy graph.

GraphRAG for Code: Demonstrations & Capabilities 01:55

  • A 200-line, 7-file Python game was used to test code understanding; regular RAG provided a "useless" description.
  • GraphRAG for code provided a much better, semantically rich description, demonstrating its ability to perform "global queries" and understand the entire repository.
  • Direct LLM translation of the Python game to Rust failed to compile, but GraphRAG successfully translated the full game, which then worked natively.
  • GraphRAG was applied to the 100,000-line, 231-file Doom codebase; standard LLMs failed to modify it meaningfully without GraphRAG.
  • It successfully generated high-level, repository-level documentation for Doom, understanding modules across 20-30 files.
  • A GitHub Copilot coding agent, wired to GraphRAG, successfully added a jump capability to the Doom game, which typically requires complex multi-file modifications and causes other AI agents to fail due to lack of holistic understanding.

Benchmark QED: A New Evaluation Framework 09:39

  • Benchmark QED, an open-source tool, is being released to measure and evaluate RAG systems for local and global quality metrics.
  • It has three components: Auto Q (query generation), Auto E (evaluation using LLM as a judge), and Auto D (dataset summarization and sampling).
  • Auto Q generates queries that span local to global understanding, and are either data-driven or persona/activity-driven.
  • An example of a data local question is: "Why are junior doctors in South Korea striking in February 2024?"
  • An example of an activity global question is: "What are the main public health initiatives mentioned that target underserved communities?"
  • Auto E provides a composite score based on comprehensiveness, diversity, empowerment, and relevance.

Lazy Graph: Performance & Future Applications 12:17

  • Lazy GraphRAG was compared to vector RAG across 8K, 120K, and 1 million token context windows.
  • Lazy GraphRAG showed dominant performance, winning 92%, 90%, and 91% of the time against vector RAG on data local questions, and performed well across all question types.
  • Long context windows did not significantly improve vector RAG's performance against Lazy GraphRAG.
  • Lazy Graph was a tenth of the cost compared to 1 million token context windows.
  • Lazy GraphRAG will be incorporated into Azure Local and the Microsoft Discovery Platform for graph-based scientific co-reasoning.