How to look at your data — Jeff Huber (Choma) + Jason Liu (567)

Introduction and Overview 00:00

  • Jeff Huber (Chroma) and Jason Liu (567) present on best practices for analyzing data as AI practitioners.
  • All content, tools, and code discussed are open source and available for download via QR codes.
  • The focus is on practical methods to systematically improve AI systems by "looking at your data" from both input and output perspectives.

How to Look at Your Inputs 01:32

  • Assessing retrieval quality involves various strategies: guessing, using LLMs as judges (which can be expensive and slow), or relying on public benchmarks.
  • Advocates for "fast evals": quick, inexpensive, and empirical evaluation by using query-document pairs ("golden datasets") to check if expected results are retrieved.
  • Fast evals enable rapid experimentation and systematic improvement.
  • Synthetic queries can be generated by LLMs, but require careful design to avoid unrealistic scenarios.
  • Comparison with public benchmarks (like MTeb) may show new embedding models excelling in general, but not necessarily on your own data.
  • Empirical testing on your dataset can reveal which embedding model actually performs best for your specific application.
  • Case study: For the Weights & Biases chatbot, empirical tests showed Voyage 3 large performed best, contrary to public benchmarks suggesting otherwise.
  • Full report, open source code, and further resources are made available for independent experimentation.

How to Look at Your Outputs 07:12

  • Output analysis is essential for understanding how users interact with systems, particularly when manual review becomes impractical due to scale.
  • Conversation logs contain rich feedback—user frustrations, retry attempts, and implicit feedback can be extracted directly from these logs.
  • Effective analysis requires segmenting data by characteristics (e.g., user demographics, query types), similar to marketing strategies for targeting audiences.
  • Standard data analysis techniques (summarization, clustering, aggregation) are used to extract themes and identify areas for improvement.
  • Example workflow: extract summaries and errors from conversations, cluster data, detect patterns (e.g., high demand for data visualization tools).
  • The cura library is introduced for automating these steps: summarizing, clustering, and aggregating conversational data.

Making Decisions with Data Segmentation 13:07

  • By segmenting and analyzing clusters, teams can target specific user needs (e.g., high volume of SEO-related queries may justify new integrations or prompt changes).
  • Impact-weighted analysis aligns development priorities with observed user behavior (e.g., investing in features that affect the largest or most active user groups).
  • Many improvements come from infrastructure tweaks (e.g., adding filters or better extraction steps), not solely from AI model enhancements.
  • Evaluating clusters against key performance indicators (KPIs) reveals what should be fixed, built, or deprioritized based on usage and performance.

Continuous Evaluation and Product Development Strategy 16:07

  • Systematic data-driven evaluation informs the product roadmap and supports rapid hypothesis testing and iteration.
  • Recommendations include: focus on retrieval quality first, use custom evals rather than public benchmarks, and consider synthetic data until user data is available.
  • Once real user data exists, extract structure from conversations and analyze for usage patterns, errors, and opportunities.
  • Population-level cluster analysis guides prioritization of development work and justifies investments with quantitative evidence.
  • Resources (Jupyter notebooks, reports) are available to help practitioners implement these methods on their own datasets.

Q&A and Closing Thoughts 18:33

  • Additional resources and notebooks for replicating the discussed analysis are provided via QR codes.
  • Final Q&A includes a "spicy take": suggestion that agent businesses should price based on successful work delivered, not on tokens used.
  • Session concludes with an invitation for further questions and discussion outside the recorded talk.