SUMM

Server environments are provided for participants, with access instructions based on attendee numbers
Attendees use "attendee" plus their number as both username and password to access Jupyter notebooks and Neo4j browser
Workshop materials, links, and slide deck are available via a designated Slack channel
The goal of the session is an introductory exploration of GraphRAG (Retrieval-Augmented Generation with a Knowledge Graph) using Neo4j, focusing on graph basics, unstructured data, and building a simple retrieval agent

Typical GraphRAG architecture places a knowledge graph between UIs/agents/AI models and data sources
Knowledge graphs can ingest both structured (e.g., tables) and unstructured (e.g., documents) data
Provides improved retrieval logic, transparency, and control, especially valuable for agentic workflows that decompose queries
Example use case: skills and employee graph for talent search and team formation

Neo4j models data as a property graph with nodes (nouns), relationships (verbs), and properties (attributes)
The Cypher query language expresses traversals and relationships (e.g., MATCH patterns), similar in concept to SQL but optimized for graphs
Neo4j supports storage and indexing of embeddings (vectors) for semantic search
Analytics features include algorithms for centrality, community detection (e.g., Leiden/Louvain), and pathfinding

Workshop uses a simple dataset: email, name, and list of skills per person
Steps include chunking data, setting uniqueness constraints (e.g., email as node key), and loading data as person–knows–skill relationships
Visualization tools allow inspection of nodes, relationships, and traversals in the graph
Importance of constraints for query performance and accurate merging is emphasized

Various Cypher queries are demonstrated:
- Counting people per skill
- Finding people similar by shared skills
- Multi-hop traversals for deeper relationship discovery
Distinct is sometimes used for deduplication in query results
Measuring similarity can be done via shared skills or by building and storing a relationship (e.g., similar skill set) to avoid repeated computation

Graph Data Science algorithms like Leiden are used to identify communities of similar-skilled people
Clustering helps segment employees (or entities) into groups for tasks like staffing or performance analysis
Heatmaps can visualize skill prevalence within communities
The value of community detection depends on use case (clustering vs. simple pairwise similarity)

Natural-language-like data models (e.g., person–knows–skill) are effective for dynamic agents
Simpler models generally yield better agent query generation
Number of node labels vs. use of properties should be balanced to avoid excessive schema cardinality
Schema and example query patterns can be annotated and exposed to agents
Persisted schema representations (e.g., JSON annotated with descriptions) help language models generate appropriate queries

Skill and description embeddings are loaded and indexed in the graph
Vector search enables finding semantically similar skills even with different descriptions or terminology
Semantic relationships can be materialized as edges and visualized in clusters (e.g., Python, Java, cloud skills)
Custom scoring formulas can blend hard (exact) and semantic (vector) similarities in retrieval patterns

Both symbolic (exact match) and vector-based (semantic) methods can be combined in queries
Examples include balancing semantic similarity with direct skill overlap when searching for similar people or skills
Graph databases support variable-length path queries, enabling discovery of indirect connections (via 0–2 hops, for example)

Module demonstrates extracting entities (like people and skills) from free-text bios/resumes using prompts and Pydantic classes
Extracted structured data can be loaded into the graph using similar merge techniques as with tabular data
More advanced pipelines (document chunking, async processing) are available via dedicated Python packages

Beyond entities, document structure (sections, subsections) can be modeled in the graph
Enables hierarchical retrieval, traversal across document parts (e.g., for clause extraction, legal analysis)
Facilitates community summaries and search patterns spanning metadata and entities within documents

Demonstration of building a Langraph agent with Neo4j backend
Four main tools are implemented:
- Retrieve skills of a person
- Retrieve similar skills
- Retrieve similar people (via vector and/or skill overlap)
- Retrieve people by skill set
Tools leverage combinations of Cypher patterns, vector searches, and annotated schema for flexible retrieval
Agent responds appropriately to a range of questions, invoking different tools as needed based on natural language prompts

Jupyter server access is temporary; all code and datasets are available in a linked GitHub repository
Slides and further resources are available via Slack
Additional workshops and hands-on demonstrations (including advanced community detection) are offered at the event and at the booth

Intro to GraphRAG — Zach Blumenfeld