Converts non-text resources (PDFs, videos, images) to text (markdown), making them LLM-legible
Vendors chosen for parsing: Llama Parse for documents/images, Firecrawl for websites, Cloudglue for audio/video
Selection criteria prioritized support for resource types, markdown output, and web hooks; initially deprioritized accuracy, comprehensiveness, and cost
Chunking
Breaks down markdown blobs into semantic entities for embedding and retrieval
Uses a combination of splitting by markdown headers, sentences, and tokens to preserve logical structure and prevent overly long chunks
Storage
Chose Pinecone for the vector database due to its ease of use, bundled embedding models, and strong customer support
Explored other storage options before settling on vector DB for efficient similarity search
Retrieval
Adopted evolving RAG (retrieval augmented generation) practices, moving from traditional to agentic to deep research RAG
Uses Leta (cloud agent provider) to create a deep research agent that plans, retrieves necessary context, and synthesizes responses for Alice
Visualization
Introduced interactive 3D visualization of the knowledge base, showing how context is stored and retrieved
Allows users to examine the AI's "brain", clicking on vectors to view associated content, increasing transparency and trust
Integrated with UI to let users upload resources, query Alice, and inspect campaign knowledge in Q&A form
RAG implementation is more complex than expected, with many micro-decisions and technical challenges
Recommend delivering a working production version before benchmarking and optimizing
Leverage vendor expertise and support during development
Upcoming focus areas: tracking hallucinations in emails, evaluating parsing vendors for accuracy/completeness, experimenting with hybrid RAG (graph + vector DB), and cost reduction across the pipeline