SUMM

Ben Hylak introduces himself as CTO of Raindrop, a company that helps identify and fix issues in AI products
Sid Bendre, co-founder of Aliev, joins to share his experience scaling viral AI products
The focus is on iterating AI products rather than discussing evaluation (eval) methods
Recent advancements have shown that small, focused models can excel at specialized tasks

Even top companies like OpenAI face challenges shipping reliable products (e.g., issues with Codeex and chatbots)
Examples of AI failures: Virgin Money’s chatbot misinterpreting "virgin" as inappropriate, Google Cloud confusing credits, Grok responding inappropriately to queries
AI product mistakes often enter public awareness only because of product visibility

Raindrop works with a diverse range of companies and products, gathering firsthand and customer-level insights
Continual analysis of large-scale event data helps identify what works and what doesn't in AI deployments

Some aspects get easier (e.g., structured API responses are now simpler than a year ago)
Communication with AI remains inherently difficult, as clear intent is hard to specify—even for humans
As models become more capable, they reveal more undefined behavior and edge cases
You can’t fully define a product’s scope upfront; ongoing iteration is required

Evals don’t reliably indicate overall product quality; they only test known situations and are easily saturated
Language models (LMs) making subjective judgments, such as rating jokes, are unreliable as evaluators
The best companies use curated datasets and autogradable evals, not generic LM judgments
Moving offline evals to production is costly, hard to set up, and doesn’t catch unforeseen issues
Real-world user signals are more valuable for identifying emerging problems than static evals

Unlike traditional apps with concrete errors, AI apps require careful monitoring of user signals (explicit and implicit)
Explicit signals: direct feedback like thumbs up/down, portion of response copied, user preferences, errors
Implicit signals: inferred data such as refusals, task failures, or user frustration
Signals should be combined with user intent to define, discover, and refine AI issues over time
Staying close to data and user feedback is essential for improving AI products

Sid introduces Trellis, Oleve’s systematic approach to refining and scaling viral AI products
Trellis’ three core principles: discretization (breaking down output space into focus areas), prioritization (ranking these areas for business impact), recursive refinement (continually improving within each area)
Six steps of Trellis include: launching MVP to gather data, classifying user intents, converting intents to workflows, prioritizing based on metrics, analyzing failures, and refining further
Prioritization should combine workflow volume with negative sentiment and estimated achievable improvement, rather than just usage volume
Structured, self-contained workflows allow faster iteration and more reliable improvements

Systematic refinement makes AI “magic” repeatable, attributable, and engineered rather than accidental
Readers can learn more about the Trellis framework via an online blog post linked by QR code

Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)