Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Introduction to Vercel v0 and the Fruit Letter Counter App 00:01

  • Vercel v0 is a full-stack, no-code platform designed for quick prototyping and web app development.
  • New features include GitHub sync for pushing and pulling code, branch switching, and collaboration via PRs.
  • The app has achieved a milestone of 100 million messages sent.
  • The talk focuses on application-layer evals, highlighting the need for evaluating AI models in real use cases, outside of controlled research lab settings.

Challenges of Reliability in AI Apps 02:13

  • AI apps, including simple ones like the Fruit Letter Counter, demonstrate LLM unreliability even after initial successful tests.
  • After deployment, unexpected user queries (e.g., "how many Rs in strawberry" or in multiple fruits) can expose failures not caught by limited pre-launch tests.
  • Traditional unit and end-to-end tests cover only deterministic code, leaving critical gaps in AI-driven components.
  • A significant portion of the app will function reliably, but a smaller, crucial percentage can frequently fail and affect user experience.

Visualizing and Building Evals 05:06

  • Evals are presented via a basketball court analogy: shots (data points) plotted as made or missed attempts.
  • Proximity to the basket represents task difficulty and match to the app’s core domain; out-of-bounds shots represent irrelevant or out-of-scope queries.
  • Building robust evals requires collecting real user data, understanding boundaries, and ensuring diverse coverage across the "court".
  • Prompts outside the core domain should not be prioritized for eval development.

Data Collection and Understanding Your "Court" 07:54

  • Gather as much real user data as possible through thumbs up/down feedback, random sampling from logs, community forums, and social media reports.
  • No shortcuts exist; thorough, regular data review is essential.
  • A balanced and well-mapped eval dataset exposes both strong and weak app areas, helping teams target future improvements.

Structuring Evals: Constants and Tasks 09:15

  • Distinguish between constants (user data) and variables (evaluation tasks like system prompts, pre-processing, RAG).
  • This separation promotes clarity, reuse, and easier updates when experimenting with new configurations.
  • Use middleware (e.g., in AI SDK) to keep shared logic between API routes and evals, ensuring "practice" matches "real game" conditions.

Designing Effective Scoring for Evals 10:40

  • Scoring strategies must fit the domain; deterministic pass/fail checks are preferred for simplicity and clarity, aiding debugging and collaboration.
  • Complex tasks may require human review to collect accurate evaluation signals.
  • Enhancing prompts for easier automated scoring (e.g., using answer tags) is encouraged during evaluation, even if not used in production.

Continuous Integration and Reporting with Evals 12:38

  • Integrate evals into CI pipelines to monitor and report on improvements or regressions with each code/prompt change.
  • Visualizations (like the basketball court) help identify which app areas improve or break with each iteration.

Summary and Outcomes of Good Evals Practice 13:10

  • Evals should be treated as core to the application's data and model improvement loop.
  • Systematically applying and tracking evals leads to greater app reliability, better user experience, higher retention, and reduced operational overhead.
  • Continuous measurement and visual reporting enable precise and targeted improvements.

Q&A and Additional Insights 14:18

  • Running evals frequently (daily or pre-deployment) helps track performance regressions.
  • By running the same questions multiple times, teams can measure consistency and identify areas where reliability drops with more challenging or less common inputs.