The speaker, Manu, from Brain Trust, shares his personal "eval journey" which began with a childhood disappointment in rule-based technology.
He dedicated his career to software engineering in the AI industry, including working on self-driving cars.
In self-driving cars, simply tuning models or adjusting loss functions was not enough to ship to production; there was a need to understand if the model actually worked in real-world scenarios, such as avoiding pedestrians or obeying traffic laws.
This experience highlighted the necessity of evals to contextualize and validate AI models for real-world applications.
Evals are not just unit tests for AI or solely for finding regressions.
Relying on shipping to production for signal on changes is expensive, slow, and risky.
Investing in good evals creates a "laboratory" for running experiments, allowing 90% of the product iteration loop to occur before production.
This process enables much quicker and more confident shipping of AI products.
Applying the same offline metrics to online production data provides data-driven insights into which real-world examples are most useful for the next iteration loop.