Speakers introduce themselves: Chang She (CEO/co-founder of LANCB and co-author of pandas) and Calvin Qi (leads a team at Harvey AI focused on RAG for complex legal documents)
The talk focuses on challenges, solutions, and learnings from deploying Retrieval-Augmented Generation (RAG) systems at scale in the legal domain
Harvey is a legal AI assistant for law firms to help with legal tasks such as drafting, analyzing documents, and workflows
Handles varying data volumes: on-demand uploads (1–50 documents), larger project-based vaults (aggregating contracts, litigation docs for a specific project), and expansive data corpuses (knowledge bases including laws and regulations globally)
Need to handle extremely large volumes of complex, lengthy, and dense legal documents
Retrieval issues: representing and indexing both sparse (keyword) and dense (semantic) data
Queries are highly complex and domain-specific, often involving specialist legal language, nested criteria, and referencing specific regulations
Data security and privacy are critical due to confidentiality requirements in legal work
Evaluation of retrieval systems is essential to ensure quality
Retrieval Quality, Query Complexity, and Evaluation Approaches 03:24
Example given of a complex, multi-layered legal query that involves semantic understanding, implicit filtering (e.g., by date), references to specialized datasets, and legal jargon
Evaluation is more critical than algorithmic novelty in this domain; significant time is invested in system validation
Evaluation strategies span:
High fidelity but expensive expert reviews
Semi-automated expert-labeled datasets
Automated metrics (e.g., retrieval precision and recall, deterministic folder/section checks)
Tradeoff between evaluation cost and depth; investing in evaluation-driven development is highlighted as key
Data Integration, Organization, and Performance 06:15
Supports massive datasets covering legislation and regulations across multiple countries
Involvement of domain experts in classification and filtering, with automation/Large Language Models (LLMs) used where possible
Needs high online (query) and offline (ingestion, experiments) performance; corpuses can contain tens of millions of documents, many very large
Infrastructure must be reliable and scalable for large user bases
Product teams want to focus on business logic and quality, not low-level database tuning
Requires flexibility for customer-specific data privacy, retention policies, database telemetry, and access
Variety in query types: exact, semantic, filtered, and dynamic queries all must be supported with high performance
The LanceDB Approach and Technical Architecture 09:12
LanceDB is presented as an AI-native multimodal lakehouse, not just a vector database
Supports storing all AI data (images, audio, embeddings, text, tabular, time-series) in one place, on object storage
Enables batch and online use cases via distributed architecture—serves from cloud, separates compute and storage, and offers simple APIs in Python/TypeScript
Supports sophisticated retrieval tasks, including combining vectors with full-text search and reranking
For large tables, supports GPU indexing (billions of vectors indexed within 2–3 hours)
Claims significant cost reductions thanks to compute-storage separation and object store usage
Core Data Storage Innovations and Compatibility 12:03
LanceDB is the only database allowing storage of images, videos, audio, embeddings, and tabular data together in one table
All analytical, training, and search workloads can use the same data without duplications or fragmentation
Built on the open-source Lance format, which supports features like fast random access, efficient scans, and storage of large and small data ("blob" and scalar)
Format is compatible with Apache Arrow; can be integrated with tools like Spark, Ray, PyTorch, pandas, and polars
Key Takeaways for Building Domain-Specific RAG Systems 15:08
Domain-specific RAG systems present unique challenges requiring deep understanding of data structure and domain use case patterns
Iteration speed and flexibility are crucial—systems must adapt to fast-changing tools and paradigms
Strong evaluation processes (expert or automated) enable faster, more reliable iteration and quality improvements
New data infrastructure must accommodate multimodal data, heavy use of embeddings, and ever-growing data scales