Intro to GraphRAG — Zach Blumenfeld

Workshop Setup and Introduction 00:03

  • Server environments are provided for participants, with access instructions based on attendee numbers
  • Attendees use "attendee" plus their number as both username and password to access Jupyter notebooks and Neo4j browser
  • Workshop materials, links, and slide deck are available via a designated Slack channel
  • The goal of the session is an introductory exploration of GraphRAG (Retrieval-Augmented Generation with a Knowledge Graph) using Neo4j, focusing on graph basics, unstructured data, and building a simple retrieval agent

GraphRAG Overview and Knowledge Graph Motivation 09:09

  • Typical GraphRAG architecture places a knowledge graph between UIs/agents/AI models and data sources
  • Knowledge graphs can ingest both structured (e.g., tables) and unstructured (e.g., documents) data
  • Provides improved retrieval logic, transparency, and control, especially valuable for agentic workflows that decompose queries
  • Example use case: skills and employee graph for talent search and team formation

Knowledge Graphs, Cypher, and Neo4j Fundamentals 12:06

  • Neo4j models data as a property graph with nodes (nouns), relationships (verbs), and properties (attributes)
  • The Cypher query language expresses traversals and relationships (e.g., MATCH patterns), similar in concept to SQL but optimized for graphs
  • Neo4j supports storage and indexing of embeddings (vectors) for semantic search
  • Analytics features include algorithms for centrality, community detection (e.g., Leiden/Louvain), and pathfinding

Loading Structured Data and Graph Construction 20:00

  • Workshop uses a simple dataset: email, name, and list of skills per person
  • Steps include chunking data, setting uniqueness constraints (e.g., email as node key), and loading data as person–knows–skill relationships
  • Visualization tools allow inspection of nodes, relationships, and traversals in the graph
  • Importance of constraints for query performance and accurate merging is emphasized

Querying the Graph and Analytical Patterns 26:47

  • Various Cypher queries are demonstrated:
    • Counting people per skill
    • Finding people similar by shared skills
    • Multi-hop traversals for deeper relationship discovery
  • Distinct is sometimes used for deduplication in query results
  • Measuring similarity can be done via shared skills or by building and storing a relationship (e.g., similar skill set) to avoid repeated computation

Graph Analytics: Community Detection and Enrichment 35:54

  • Graph Data Science algorithms like Leiden are used to identify communities of similar-skilled people
  • Clustering helps segment employees (or entities) into groups for tasks like staffing or performance analysis
  • Heatmaps can visualize skill prevalence within communities
  • The value of community detection depends on use case (clustering vs. simple pairwise similarity)

Data Modeling and Schema Best Practices 41:05

  • Natural-language-like data models (e.g., person–knows–skill) are effective for dynamic agents
  • Simpler models generally yield better agent query generation
  • Number of node labels vs. use of properties should be balanced to avoid excessive schema cardinality
  • Schema and example query patterns can be annotated and exposed to agents
  • Persisted schema representations (e.g., JSON annotated with descriptions) help language models generate appropriate queries

Semantic Similarity and Vector Search in Graphs 44:17

  • Skill and description embeddings are loaded and indexed in the graph
  • Vector search enables finding semantically similar skills even with different descriptions or terminology
  • Semantic relationships can be materialized as edges and visualized in clusters (e.g., Python, Java, cloud skills)
  • Custom scoring formulas can blend hard (exact) and semantic (vector) similarities in retrieval patterns

Combining Retrieval, Vector Search, and Traversals 50:28

  • Both symbolic (exact match) and vector-based (semantic) methods can be combined in queries
  • Examples include balancing semantic similarity with direct skill overlap when searching for similar people or skills
  • Graph databases support variable-length path queries, enabling discovery of indirect connections (via 0–2 hops, for example)

Unstructured Data: Entity Extraction and Ingestion 55:10

  • Module demonstrates extracting entities (like people and skills) from free-text bios/resumes using prompts and Pydantic classes
  • Extracted structured data can be loaded into the graph using similar merge techniques as with tabular data
  • More advanced pipelines (document chunking, async processing) are available via dedicated Python packages

Document-Centric Graph Modeling 61:18

  • Beyond entities, document structure (sections, subsections) can be modeled in the graph
  • Enables hierarchical retrieval, traversal across document parts (e.g., for clause extraction, legal analysis)
  • Facilitates community summaries and search patterns spanning metadata and entities within documents

Building a Simple Retrieval Agent (Module 3) 65:00

  • Demonstration of building a Langraph agent with Neo4j backend
  • Four main tools are implemented:
    • Retrieve skills of a person
    • Retrieve similar skills
    • Retrieve similar people (via vector and/or skill overlap)
    • Retrieve people by skill set
  • Tools leverage combinations of Cypher patterns, vector searches, and annotated schema for flexible retrieval
  • Agent responds appropriately to a range of questions, invoking different tools as needed based on natural language prompts

Workshop Wrap-up and Resources 76:49

  • Jupyter server access is temporary; all code and datasets are available in a linked GitHub repository
  • Slides and further resources are available via Slack
  • Additional workshops and hands-on demonstrations (including advanced community detection) are offered at the event and at the booth