Graph Intelligence: Enhance Reasoning and Retrieval Using Graph Analytics - Alison & Andreas, Neo4j

Session Goals: Enhancing RAG with Graph Data Science 03:20

  • The session aims to demonstrate how graph data science can improve RAG applications by providing new ways to understand and access data.
  • The goal is for attendees to become comfortable with basic graph algorithms and identify applications for their projects.
  • AI engineers are concerned with both the "pipes" (infrastructure) and the "water" (data quality) in their systems.

RAG Challenges and Graph Basics 05:28

  • Common RAG challenges include getting the right data, effective chunking strategies, handling temporal data, and ensuring data quality.
  • Understanding relationships among data and managing data volume are significant challenges.
  • Knowledge graphs combine structured and unstructured data, finding implicit knowledge within documents to create relationships.
  • A graph consists of nodes (entities/nouns), relationships (connections between entities), and properties (details on nodes or relationships).
  • Embeddings (vector representations of text semantics) can be stored as properties on graph nodes, allowing direct connection from embedding to all related information.

Graph for Application Management 13:57

  • The "application graph" or "memory graph" tracks system activity, including messages, user prompts, responses, and context documents.
  • This graph helps manage massive volumes of documents and data by monitoring what's happening in the application.
  • It allows for identifying important documents, understanding what influences outcomes, and managing documents at scale.
  • The overall graph connects the application flow (memory graph) to chunks and then to domain-specific structured or extracted unstructured data.

Scaling and Performance Discussions 18:51

  • For large graphs, it's recommended to start small, iterate on the schema (ontology), and avoid over-specifying relationship types.
  • The structure of the data model should be driven by the types of questions users will ask and evaluation results.
  • Comparing hierarchical tree-based structures (like Raptor indexes) with knowledge graphs depends on the specific use case and evaluation metrics.

Power of Graphs in RAG 23:05

  • Graph RAG and graph analytics expand the number of answerable questions for the same base dataset exponentially.
  • Pair-wise connections between chunks create n-squared possible new pieces of information.
  • Community detection within the connected chunks can lead to an exponential increase in answerable questions by identifying meaningful subsets.

Practical Setup: Neo4j Aura and Data Loading 25:28

  • Attendees are guided to set up a free Neo4j Aura DB Professional trial instance via console.neo4j.io.
  • A pre-populated Neo4j dump file is provided for quick data loading into the new instance using the backup and restore feature.

Exploring the Database and Data Model 32:21

  • Basic Cypher queries (e.g., MATCH (n) RETURN n LIMIT 10) are used to explore the graph data.
  • The data model includes Session (user login), Message (user prompt/assistant response), Prompt, Response, and Context Document (chunks).
  • The graph visualizes conversation flow, showing how prompts connect to responses and which context documents are used.

Q&A on Performance and Multimodal Data 37:06

  • Chunk size generally doesn't impact performance significantly unless chunks are megabytes large.
  • Storing vectors directly as properties on graph nodes eliminates the need for a separate vector database.
  • For very large datasets (hundreds of millions of nodes), scaling considerations become more critical.
  • Multiple vector indices can be created for different use cases, filtering, or governance (e.g., making certain data available only to specific users).
  • For multimodal data (e.g., images), large media files are stored externally (e.g., S3 bucket), and their embeddings are stored as properties on graph nodes, with URLs linking to the external files.

Connecting to Graph and Summary Statistics 51:15

  • The get to know your graph Jupyter notebook demonstrates connecting to the Neo4j database using the Python driver.
  • Initial database statistics show 17,000 nodes, 774,000 relationships, 7 labels, and 27 relationship types in the loaded dataset.
  • Nodes can have multiple labels, allowing for flexible querying (e.g., filtering for all messages, or specifically for assistant responses).

Graph Analytics: Connect, Cluster, Curate 58:31

  • The "Connect, Cluster, Curate" approach helps manage RAG applications at scale and improve outcomes.
  • Connect: Running K-Nearest Neighbors (KN&N) on document embeddings to create similarity relationships between documents.
  • Cluster: Using community detection algorithms (like Louvain) to group similar documents into "communities."
  • Curate: Leveraging these techniques to refine and manage the grounding dataset.

Understanding Community Detection Algorithms 1:02:01

  • Louvain Algorithm: Based on modularity optimization, it identifies communities where internal connections are highly interconnected, and external connections are low. It iteratively assigns labels to nodes to maximize modularity.
  • Label Propagation Algorithm: A fast clustering algorithm that assigns a random label to a node and then propagates it to neighbors, converging on communities based on density.

Applying KN&N and Document Analysis 1:07:01

  • A graph projection is created to run KN&N specifically on document nodes, including their embeddings, to build similarity connections.
  • Visualizing these connections often results in a "hairball" graph, but pockets of density (clusters) can be observed.
  • Highly similar document clusters (e.g., an average similarity of 0.98 among 49 documents in community 4702) can indicate redundancy.

Improving RAG with Graph Insights 1:18:21

  • A high-quality grounding dataset for RAG should be relevant, augmenting, reliable, and efficient.
  • High similarity among documents can be inefficient, as a retriever might return many identical or near-identical chunks, limiting context.
  • Reranking strategies can be employed to increase diversity in responses (e.g., weighting for diversity or PageRank) to avoid "echo chambers."
  • Collapsing Similar Documents: The apoc.nodes.collapse procedure can merge highly similar document nodes into a single node, maintaining original relationships (lineage) and improving retrieval efficiency without losing source information.

Analyzing Conversation Flow and Content Hygiene 1:32:01

  • Tracking how users "travel" through communities of documents in conversations provides insights into human cognition and effective information paths.
  • Graph algorithms (e.g., Betweenness Centrality, PageRank) can identify influential or problematic documents within the graph.
  • Analyzing good/bad ratings by community can pinpoint areas of knowledge that are outdated or problematic, requiring content review.
  • Co-occurrence relationships can be created between documents that are retrieved together for a given question, revealing natural affinities between pieces of information.

Broader Implications and Q&A 1:46:01

  • In sensitive domains like DoD/Fed, graph technologies enhance accountability and traceability of AI systems, emphasizing "trust but verify."
  • Automatic Knowledge Graph Creation: Neo4j's KG Builder uses LLMs to extract named entities and relationships from unstructured documents, allowing the ontology to "rise organically" for experimentation or when domain expertise is limited.
  • Graph analytics can help content writers understand how content is consumed by agents and how to produce content that aids autonomous agents.
  • Graph analysis can potentially identify content hygiene issues like contradictions or inaccuracies by examining connected information.
  • Combining text embeddings with node embeddings (representing the structure and relationships of a node) can lead to more nuanced retrieval and analysis.