SUMM

Vercel v0 is a full-stack, no-code platform designed for quick prototyping and web app development.
New features include GitHub sync for pushing and pulling code, branch switching, and collaboration via PRs.
The app has achieved a milestone of 100 million messages sent.
The talk focuses on application-layer evals, highlighting the need for evaluating AI models in real use cases, outside of controlled research lab settings.

AI apps, including simple ones like the Fruit Letter Counter, demonstrate LLM unreliability even after initial successful tests.
After deployment, unexpected user queries (e.g., "how many Rs in strawberry" or in multiple fruits) can expose failures not caught by limited pre-launch tests.
Traditional unit and end-to-end tests cover only deterministic code, leaving critical gaps in AI-driven components.
A significant portion of the app will function reliably, but a smaller, crucial percentage can frequently fail and affect user experience.

Evals are presented via a basketball court analogy: shots (data points) plotted as made or missed attempts.
Proximity to the basket represents task difficulty and match to the app’s core domain; out-of-bounds shots represent irrelevant or out-of-scope queries.
Building robust evals requires collecting real user data, understanding boundaries, and ensuring diverse coverage across the "court".
Prompts outside the core domain should not be prioritized for eval development.

Gather as much real user data as possible through thumbs up/down feedback, random sampling from logs, community forums, and social media reports.
No shortcuts exist; thorough, regular data review is essential.
A balanced and well-mapped eval dataset exposes both strong and weak app areas, helping teams target future improvements.

Distinguish between constants (user data) and variables (evaluation tasks like system prompts, pre-processing, RAG).
This separation promotes clarity, reuse, and easier updates when experimenting with new configurations.
Use middleware (e.g., in AI SDK) to keep shared logic between API routes and evals, ensuring "practice" matches "real game" conditions.

Scoring strategies must fit the domain; deterministic pass/fail checks are preferred for simplicity and clarity, aiding debugging and collaboration.
Complex tasks may require human review to collect accurate evaluation signals.
Enhancing prompts for easier automated scoring (e.g., using answer tags) is encouraged during evaluation, even if not used in production.

Integrate evals into CI pipelines to monitor and report on improvements or regressions with each code/prompt change.
Visualizations (like the basketball court) help identify which app areas improve or break with each iteration.

Evals should be treated as core to the application's data and model improvement loop.
Systematically applying and tracking evals leads to greater app reliability, better user experience, higher retention, and reduced operational overhead.
Continuous measurement and visual reporting enable precise and targeted improvements.

Running evals frequently (daily or pre-deployment) helps track performance regressions.
By running the same questions multiple times, teams can measure consistency and identify areas where reliability drops with more challenging or less common inputs.

Evals Are Not Unit Tests — Ido Pesok, Vercel v0