Assessing retrieval quality involves various strategies: guessing, using LLMs as judges (which can be expensive and slow), or relying on public benchmarks.
Advocates for "fast evals": quick, inexpensive, and empirical evaluation by using query-document pairs ("golden datasets") to check if expected results are retrieved.
Fast evals enable rapid experimentation and systematic improvement.
Synthetic queries can be generated by LLMs, but require careful design to avoid unrealistic scenarios.
Comparison with public benchmarks (like MTeb) may show new embedding models excelling in general, but not necessarily on your own data.
Empirical testing on your dataset can reveal which embedding model actually performs best for your specific application.
Case study: For the Weights & Biases chatbot, empirical tests showed Voyage 3 large performed best, contrary to public benchmarks suggesting otherwise.
Full report, open source code, and further resources are made available for independent experimentation.
Output analysis is essential for understanding how users interact with systems, particularly when manual review becomes impractical due to scale.
Conversation logs contain rich feedback—user frustrations, retry attempts, and implicit feedback can be extracted directly from these logs.
Effective analysis requires segmenting data by characteristics (e.g., user demographics, query types), similar to marketing strategies for targeting audiences.
Standard data analysis techniques (summarization, clustering, aggregation) are used to extract themes and identify areas for improvement.
Example workflow: extract summaries and errors from conversations, cluster data, detect patterns (e.g., high demand for data visualization tools).
The cura library is introduced for automating these steps: summarizing, clustering, and aggregating conversational data.
By segmenting and analyzing clusters, teams can target specific user needs (e.g., high volume of SEO-related queries may justify new integrations or prompt changes).
Impact-weighted analysis aligns development priorities with observed user behavior (e.g., investing in features that affect the largest or most active user groups).
Many improvements come from infrastructure tweaks (e.g., adding filters or better extraction steps), not solely from AI model enhancements.
Evaluating clusters against key performance indicators (KPIs) reveals what should be fixed, built, or deprioritized based on usage and performance.
Continuous Evaluation and Product Development Strategy 16:07
Systematic data-driven evaluation informs the product roadmap and supports rapid hypothesis testing and iteration.
Recommendations include: focus on retrieval quality first, use custom evals rather than public benchmarks, and consider synthetic data until user data is available.
Once real user data exists, extract structure from conversations and analyze for usage patterns, errors, and opportunities.
Population-level cluster analysis guides prioritization of development work and justifies investments with quantitative evidence.
Resources (Jupyter notebooks, reports) are available to help practitioners implement these methods on their own datasets.