Discusses practical tactics to build reliable AI applications and why these methods are not widely adopted yet
Presenter has 15 years of experience as a startup co-founder, CTO, and enterprise executive, with recent hands-on work building generative AI projects
Highlights that building reliable AI solutions is difficult due to model nondeterminism and unexpected solution impacts from changes in code, logic, prompts, or models
Many teams initially apply data science metrics (groundedness, factuality, bias), but these do not effectively measure real-world solution success
Example: A customer support bot’s actual best metric is the rate at which users escalate from AI support to human support, not just factual correctness
Emphasizes the need to focus on product experience and business outcome metrics specific to the use case
Proposes reverse engineering evaluations from the end goal, making them highly relevant to user requirements (e.g., for a support bot, every specific type of question should have clearly defined checklist criteria)
Demonstrates generating multiple variations of a key question (like password reset) and checking if answers consistently cover all required information
Recommends including different user personas in evaluation to ensure diverse phrasing and expectations are accounted for
Practical Implementation and Iterative Process 08:22
Evaluations (“evals”) should be created at the start of development, not after deployment
Initial product and evaluations are built together, then repeatedly tested; detailed results, not averages, are crucial for understanding failures
Continuous experimentation: Changing a model or prompt may fix one test but can cause regressions elsewhere—ongoing, granular testing catches unexpected issues
Achieving Reliable Benchmarks and Experimentation Freedom 10:19
The process leads to a reliable baseline (benchmark) for the application, crucial for safe future experimentation (e.g., swapping models or architectures)
Each solution type demands tailored evaluation techniques (e.g., LLM-as-judge for bots, mock database for text-to-SQL, rubric matching for classifiers)
The same thorough approach applies to building guardrails—explicitly testing for undesirable or unsupported questions/responses
AI applications should be evaluated the way users will actually interact with them, avoiding abstract or generic metrics
Frequent, use-case-specific evaluations enable rapid progress and minimize regressions
Properly defined evaluations produce explainable, transparent AI solutions
The Multinear open-source platform is mentioned as a tool for managing these evaluations, but the core methodology does not depend on any specific platform