Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)

Introduction and Overview 00:03

  • Speakers introduce themselves: Chang She (CEO/co-founder of LANCB and co-author of pandas) and Calvin Qi (leads a team at Harvey AI focused on RAG for complex legal documents)
  • The talk focuses on challenges, solutions, and learnings from deploying Retrieval-Augmented Generation (RAG) systems at scale in the legal domain

Use Cases and Data Scales at Harvey 01:19

  • Harvey is a legal AI assistant for law firms to help with legal tasks such as drafting, analyzing documents, and workflows
  • Handles varying data volumes: on-demand uploads (1–50 documents), larger project-based vaults (aggregating contracts, litigation docs for a specific project), and expansive data corpuses (knowledge bases including laws and regulations globally)

Core RAG Challenges in Legal Domain 02:22

  • Need to handle extremely large volumes of complex, lengthy, and dense legal documents
  • Retrieval issues: representing and indexing both sparse (keyword) and dense (semantic) data
  • Queries are highly complex and domain-specific, often involving specialist legal language, nested criteria, and referencing specific regulations
  • Data security and privacy are critical due to confidentiality requirements in legal work
  • Evaluation of retrieval systems is essential to ensure quality

Retrieval Quality, Query Complexity, and Evaluation Approaches 03:24

  • Example given of a complex, multi-layered legal query that involves semantic understanding, implicit filtering (e.g., by date), references to specialized datasets, and legal jargon
  • Evaluation is more critical than algorithmic novelty in this domain; significant time is invested in system validation
  • Evaluation strategies span:
    • High fidelity but expensive expert reviews
    • Semi-automated expert-labeled datasets
    • Automated metrics (e.g., retrieval precision and recall, deterministic folder/section checks)
  • Tradeoff between evaluation cost and depth; investing in evaluation-driven development is highlighted as key

Data Integration, Organization, and Performance 06:15

  • Supports massive datasets covering legislation and regulations across multiple countries
  • Involvement of domain experts in classification and filtering, with automation/Large Language Models (LLMs) used where possible
  • Needs high online (query) and offline (ingestion, experiments) performance; corpuses can contain tens of millions of documents, many very large

Infrastructure Needs and Requirements 07:15

  • Infrastructure must be reliable and scalable for large user bases
  • Product teams want to focus on business logic and quality, not low-level database tuning
  • Requires flexibility for customer-specific data privacy, retention policies, database telemetry, and access
  • Variety in query types: exact, semantic, filtered, and dynamic queries all must be supported with high performance

The LanceDB Approach and Technical Architecture 09:12

  • LanceDB is presented as an AI-native multimodal lakehouse, not just a vector database
  • Supports storing all AI data (images, audio, embeddings, text, tabular, time-series) in one place, on object storage
  • Enables batch and online use cases via distributed architecture—serves from cloud, separates compute and storage, and offers simple APIs in Python/TypeScript
  • Supports sophisticated retrieval tasks, including combining vectors with full-text search and reranking
  • For large tables, supports GPU indexing (billions of vectors indexed within 2–3 hours)
  • Claims significant cost reductions thanks to compute-storage separation and object store usage

Core Data Storage Innovations and Compatibility 12:03

  • LanceDB is the only database allowing storage of images, videos, audio, embeddings, and tabular data together in one table
  • All analytical, training, and search workloads can use the same data without duplications or fragmentation
  • Built on the open-source Lance format, which supports features like fast random access, efficient scans, and storage of large and small data ("blob" and scalar)
  • Format is compatible with Apache Arrow; can be integrated with tools like Spark, Ray, PyTorch, pandas, and polars

Key Takeaways for Building Domain-Specific RAG Systems 15:08

  • Domain-specific RAG systems present unique challenges requiring deep understanding of data structure and domain use case patterns
  • Iteration speed and flexibility are crucial—systems must adapt to fast-changing tools and paradigms
  • Strong evaluation processes (expert or automated) enable faster, more reliable iteration and quality improvements
  • New data infrastructure must accommodate multimodal data, heavy use of embeddings, and ever-growing data scales

Conclusion 16:32

  • Thanks and closing remarks