Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Introduction and Problem Statement 00:00

  • Discusses practical tactics to build reliable AI applications and why these methods are not widely adopted yet
  • Presenter has 15 years of experience as a startup co-founder, CTO, and enterprise executive, with recent hands-on work building generative AI projects
  • Highlights that building reliable AI solutions is difficult due to model nondeterminism and unexpected solution impacts from changes in code, logic, prompts, or models

Flaws in Current Evaluation Approaches 02:28

  • Many teams initially apply data science metrics (groundedness, factuality, bias), but these do not effectively measure real-world solution success
  • Example: A customer support bot’s actual best metric is the rate at which users escalate from AI support to human support, not just factual correctness
  • Emphasizes the need to focus on product experience and business outcome metrics specific to the use case

Designing Real-World, Use-Case-Specific Evaluations 04:26

  • Proposes reverse engineering evaluations from the end goal, making them highly relevant to user requirements (e.g., for a support bot, every specific type of question should have clearly defined checklist criteria)
  • Demonstrates generating multiple variations of a key question (like password reset) and checking if answers consistently cover all required information
  • Recommends including different user personas in evaluation to ensure diverse phrasing and expectations are accounted for

Practical Implementation and Iterative Process 08:22

  • Evaluations (“evals”) should be created at the start of development, not after deployment
  • Initial product and evaluations are built together, then repeatedly tested; detailed results, not averages, are crucial for understanding failures
  • Continuous experimentation: Changing a model or prompt may fix one test but can cause regressions elsewhere—ongoing, granular testing catches unexpected issues

Achieving Reliable Benchmarks and Experimentation Freedom 10:19

  • The process leads to a reliable baseline (benchmark) for the application, crucial for safe future experimentation (e.g., swapping models or architectures)
  • Each solution type demands tailored evaluation techniques (e.g., LLM-as-judge for bots, mock database for text-to-SQL, rubric matching for classifiers)
  • The same thorough approach applies to building guardrails—explicitly testing for undesirable or unsupported questions/responses

Key Takeaways and Closing Thoughts 13:20

  • AI applications should be evaluated the way users will actually interact with them, avoiding abstract or generic metrics
  • Frequent, use-case-specific evaluations enable rapid progress and minimize regressions
  • Properly defined evaluations produce explainable, transparent AI solutions
  • The Multinear open-source platform is mentioned as a tool for managing these evaluations, but the core methodology does not depend on any specific platform