Introduction to Vercel v0 and the Fruit Letter Counter App 00:01
Vercel v0 is a full-stack, no-code platform designed for quick prototyping and web app development.
New features include GitHub sync for pushing and pulling code, branch switching, and collaboration via PRs.
The app has achieved a milestone of 100 million messages sent.
The talk focuses on application-layer evals, highlighting the need for evaluating AI models in real use cases, outside of controlled research lab settings.
AI apps, including simple ones like the Fruit Letter Counter, demonstrate LLM unreliability even after initial successful tests.
After deployment, unexpected user queries (e.g., "how many Rs in strawberry" or in multiple fruits) can expose failures not caught by limited pre-launch tests.
Traditional unit and end-to-end tests cover only deterministic code, leaving critical gaps in AI-driven components.
A significant portion of the app will function reliably, but a smaller, crucial percentage can frequently fail and affect user experience.
Evals are presented via a basketball court analogy: shots (data points) plotted as made or missed attempts.
Proximity to the basket represents task difficulty and match to the app’s core domain; out-of-bounds shots represent irrelevant or out-of-scope queries.
Building robust evals requires collecting real user data, understanding boundaries, and ensuring diverse coverage across the "court".
Prompts outside the core domain should not be prioritized for eval development.
Data Collection and Understanding Your "Court" 07:54
Gather as much real user data as possible through thumbs up/down feedback, random sampling from logs, community forums, and social media reports.
No shortcuts exist; thorough, regular data review is essential.
A balanced and well-mapped eval dataset exposes both strong and weak app areas, helping teams target future improvements.
Evals should be treated as core to the application's data and model improvement loop.
Systematically applying and tracking evals leads to greater app reliability, better user experience, higher retention, and reduced operational overhead.
Continuous measurement and visual reporting enable precise and targeted improvements.
Running evals frequently (daily or pre-deployment) helps track performance regressions.
By running the same questions multiple times, teams can measure consistency and identify areas where reliability drops with more challenging or less common inputs.