Agentic GraphRAG: Simplifying Retrieval Across Structured & Unstructured Data — Zach Blumenfeld
Introduction to Graph RAG and Multi-Source Data 00:00
The presentation focuses on Agentic GraphRAG, specifically addressing the integration of multiple data sources, including both unstructured and structured data.
A general graph architecture incorporates agents and tools alongside a central knowledge graph, capable of extracting data from unstructured documents and processing structured data via standard ETLs.
The primary value of this knowledge graph architecture lies in enhancing agentic workflows by supporting reasoning and decomposing complex questions, moving beyond simple vector search.
It enables the expression of a simple data model to the agent, facilitating accurate information retrieval and decomposition, with the flexibility to expand by adding more data over time.
The practical example demonstrates an "employee graph" designed as a knowledge assistant for skills analysis, similarity identification, and skill gap detection, initially using PDF resumes as unstructured data.
Resumes are first loaded into a Neo4j graph database as basic document nodes, each with metadata, the resume text, and an embedding.
An agent is constructed within Google's ADK framework, equipped with a single tool for searching documents.
When asked "how many Python developers do I have?", the agent provides an inaccurate answer (e.g., "5") because it relies on a simple semantic similarity search based on a limited 'k' value (k=5 documents).
Similarly, queries about "who is most similar to a particular person?" yield limited results, relying solely on plain semantic similarity without precise logical control.
The agent is unable to answer aggregation questions like "summarize my technical talent and skills distribution" because it can only perform searches and lacks the capability for data aggregation or relationship analysis.
Complex relationship queries, such as "who's collaborating on lots of projects," also fail because collaboration details are embedded within the text and not explicitly modeled.
Improving Retrieval with Entity Extraction and Expressive Data Model 05:41
To enhance accuracy, the data model needs to be explicitly defined and explained to the agent, starting with basic concepts like "person," "skills," and "things people do."
Entity extraction workflows, utilizing Pydantic classes for enumerations of accomplishments, domains, and work types, are employed to build a more detailed graph from the documents.
This process decomposes documents into JSON, systematically extracting skills and accomplishments.
The refined graph model connects individuals to their skills and links their accomplishments to higher-level concepts such as building systems or shipping code.
With this expressive data model, an agent, leveraging an MCP server to interpret the schema and generate Cypher statements, can provide much more precise answers.
For the query "how many Python developers?", the agent now executes a Cypher query matching persons with Python skills, returning a more accurate count (e.g., "28 developers").
In similarity queries (e.g., "who is most similar to Lucas Martinez?"), the agent performs graph-based overlap calculations on skill sets and accomplishments, explaining its reasoning and allowing for auditable control over the results.
Aggregation questions like "summarize technical talent distribution" can now be answered, providing numerical breakdowns of skills and accomplishments across the workforce.
Additional tools can be integrated, enabling the agent to generate flexible Cypher queries for complex traversals, such as finding similar individuals by traversing multiple hops across skills, common systems, domains, and accomplishments.
This results in more flexible, higher-performing queries and more explainable answers that include specific numbers related to skills and domains.
Integrating Structured Data and Expanding the Graph 12:21
The presentation demonstrates adding structured data, specifically internal HR intelligence system data about project collaborations.
A significant advantage of graph databases is their flexibility in adding new relationships and node types without requiring extensive data model refactoring, unlike traditional relational databases when dealing with many-to-many relationships.
For instance, a one-to-many relationship (one person to one accomplishment on a resume) can seamlessly transform into a many-to-many relationship (multiple people collaborating on a project) by simply creating new relationships.
New tools can be developed to leverage this expanded graph, such as a tool to identify collaborators working on the same projects within specific domains.
This capability allows the agent to answer complex collaboration questions, like "which individuals have collaborated with each other to deliver the most AI things?", by utilizing the new tool and providing exact, data-driven responses.