Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)

Introduction & Importance of Iteration in AI Products 00:01

  • Ben Hylak introduces himself as CTO of Raindrop, a company that helps identify and fix issues in AI products
  • Sid Bendre, co-founder of Aliev, joins to share his experience scaling viral AI products
  • The focus is on iterating AI products rather than discussing evaluation (eval) methods
  • Recent advancements have shown that small, focused models can excel at specialized tasks

Real-World AI Product Issues & Challenges 02:15

  • Even top companies like OpenAI face challenges shipping reliable products (e.g., issues with Codeex and chatbots)
  • Examples of AI failures: Virgin Money’s chatbot misinterpreting "virgin" as inappropriate, Google Cloud confusing credits, Grok responding inappropriately to queries
  • AI product mistakes often enter public awareness only because of product visibility

Raindrop’s Experience & Approach 04:46

  • Raindrop works with a diverse range of companies and products, gathering firsthand and customer-level insights
  • Continual analysis of large-scale event data helps identify what works and what doesn't in AI deployments

Will Building AI Products Get Easier? 06:02

  • Some aspects get easier (e.g., structured API responses are now simpler than a year ago)
  • Communication with AI remains inherently difficult, as clear intent is hard to specify—even for humans
  • As models become more capable, they reveal more undefined behavior and edge cases
  • You can’t fully define a product’s scope upfront; ongoing iteration is required

Misconceptions and Realities of Evals 08:31

  • Evals don’t reliably indicate overall product quality; they only test known situations and are easily saturated
  • Language models (LMs) making subjective judgments, such as rating jokes, are unreliable as evaluators
  • The best companies use curated datasets and autogradable evals, not generic LM judgments
  • Moving offline evals to production is costly, hard to set up, and doesn’t catch unforeseen issues
  • Real-world user signals are more valuable for identifying emerging problems than static evals

Using Signals to Build Reliable AI Apps 11:04

  • Unlike traditional apps with concrete errors, AI apps require careful monitoring of user signals (explicit and implicit)
  • Explicit signals: direct feedback like thumbs up/down, portion of response copied, user preferences, errors
  • Implicit signals: inferred data such as refusals, task failures, or user frustration
  • Signals should be combined with user intent to define, discover, and refine AI issues over time
  • Staying close to data and user feedback is essential for improving AI products

The Trellis Framework for Scaling AI Products 15:16

  • Sid introduces Trellis, Oleve’s systematic approach to refining and scaling viral AI products
  • Trellis’ three core principles: discretization (breaking down output space into focus areas), prioritization (ranking these areas for business impact), recursive refinement (continually improving within each area)
  • Six steps of Trellis include: launching MVP to gather data, classifying user intents, converting intents to workflows, prioritizing based on metrics, analyzing failures, and refining further
  • Prioritization should combine workflow volume with negative sentiment and estimated achievable improvement, rather than just usage volume
  • Structured, self-contained workflows allow faster iteration and more reliable improvements

Conclusion & Key Takeaways 18:14

  • Systematic refinement makes AI “magic” repeatable, attributable, and engineered rather than accidental
  • Readers can learn more about the Trellis framework via an online blog post linked by QR code