SUMM

Jonathan Larson from Microsoft Research's graph team introduces the "graphite paper" and its impact.
Key enablers for effective AI applications include LLM memory with structure and agents paired with these structures.
The presentation will cover GraphRAG applied to coding, the new Benchmark QED release, and new results on lazy graph.

A 200-line, 7-file Python game was used to test code understanding; regular RAG provided a "useless" description.
GraphRAG for code provided a much better, semantically rich description, demonstrating its ability to perform "global queries" and understand the entire repository.
Direct LLM translation of the Python game to Rust failed to compile, but GraphRAG successfully translated the full game, which then worked natively.
GraphRAG was applied to the 100,000-line, 231-file Doom codebase; standard LLMs failed to modify it meaningfully without GraphRAG.
It successfully generated high-level, repository-level documentation for Doom, understanding modules across 20-30 files.
A GitHub Copilot coding agent, wired to GraphRAG, successfully added a jump capability to the Doom game, which typically requires complex multi-file modifications and causes other AI agents to fail due to lack of holistic understanding.

Benchmark QED, an open-source tool, is being released to measure and evaluate RAG systems for local and global quality metrics.
It has three components: Auto Q (query generation), Auto E (evaluation using LLM as a judge), and Auto D (dataset summarization and sampling).
Auto Q generates queries that span local to global understanding, and are either data-driven or persona/activity-driven.
An example of a data local question is: "Why are junior doctors in South Korea striking in February 2024?"
An example of an activity global question is: "What are the main public health initiatives mentioned that target underserved communities?"
Auto E provides a composite score based on comprehensiveness, diversity, empowerment, and relevance.

Lazy GraphRAG was compared to vector RAG across 8K, 120K, and 1 million token context windows.
Lazy GraphRAG showed dominant performance, winning 92%, 90%, and 91% of the time against vector RAG on data local questions, and performed well across all question types.
Long context windows did not significantly improve vector RAG's performance against Lazy GraphRAG.
Lazy Graph was a tenth of the cost compared to 1 million token context windows.
Lazy GraphRAG will be incorporated into Azure Local and the Microsoft Discovery Platform for graph-based scientific co-reasoning.

GraphRAG methods to create optimized LLM context windows for Retrieval — Jonathan Larson, Microsoft