How to build world-class AI products — Sarah Sachs (AI lead @ Notion) & Carlos Esteban (Braintrust)
Introduction and Notion AI's Evaluation Philosophy 00:00
Sarah Sachs, Lead of Notion AI, emphasizes that the rigor and excellence of building great AI products stem from observability and robust evaluations (evals).
Notion AI dedicates approximately 10% of its time to prompting and 90% to analyzing evals and usage data to ensure consistent product performance for users.
Notion is a connected workspace, integrating workplace management, documents, and third-party tools like Slack and Jira, serving over 100 million users.
Notion AI offers free trials for most products, necessitating support for massive scale, including features for concurrent users beyond paid plans.
The company prioritizes design and polish, a challenge for GenAI experiences, but aims to add this quality while maintaining rapid development.
Notion partners with various foundation model providers, fine-tunes its own models, and can deploy new models to production in less than a day.
Recent launches include AI meeting notes (speech-to-text, transcription, AI summaries), an enterprise search product, and a deep research tool that transitions from workflows to agentic capabilities with reasoning.
Notion AI's first product, the AI writer, launched before ChatGPT, leveraging early access to generative models for inline content generation.
The "autofill" feature, an AI agent living in database properties, expanded AI capabilities across databases, leading to unexpected usage patterns like multi-language column translation.
Notion gradually built a core AI team and developed a natural RAG (Retrieval-Augmented Generation) solution, offering Q&A to free users and requiring universal embeddings and multilingual workspace support.
The company began collaborating with Braintrust when launching cross-app search and attachment search, emphasizing an incremental development approach based on model capabilities.
Evaluating Notion AI is challenging due to exceptionally large datasets; Notion leverages "dog fooding" (internal use) to generate training and evaluation data.
Manual evaluation processes using Google Sheets overwhelmed human evaluators, highlighting the need for scalable and efficient solutions.
Notion found that quality of insights, rather than quantity, is crucial for fine-tuning and iteration, especially when utilizing human labelers.
The iteration cycle involves curating targeted datasets, often by "data specialists" (a mix of PM, data analyst, annotator), who create small, well-formatted handcrafted datasets.
Scoring functions are product-specific, relying heavily on "LLM as a judge" processes and heuristic-based deterministic functions (e.g., ensuring "Jira" is in a query for a Jira connector).
Multilingual evaluation is critical, with dedicated datasets to prevent regressions in language context switching.
Product managers and designers actively use Braintrust to understand model performance and user needs, effectively serving as a form of user experience research (UXR).
User feedback, particularly "thumbs down" data from internal development usage, is propagated to Braintrust to inform product development, such as discovering users wanted report drafting, not just research, from a "deep research" tool.
Notion employs two main approaches for "LLM as a judge": a single prompt for general qualities (e.g., conciseness, faithfulness) and, more insightfully, a specific prompt for each data element (e.g., "answer in Japanese, bulleted, XYZ").
The latter approach allows for precise rule-based evaluation, making golden datasets more adaptable to changing information (e.g., new documents about an offsite).
This system outputs reliable scores, quickly identifying regressions in new models and enabling rapid iteration by allowing Notion to test and switch between models (e.g., using a cheaper, faster model like Nano for high-frequency, low-reasoning tasks).
Braintrust, or similar software, is considered critical to Notion's iteration flow and part of its intellectual property, driving improvements in AI product quality through enhanced observability.
The rigorous evaluation metrics are crucial for serving Notion's large non-English speaking user base (60% of enterprise users) with an English-speaking engineering team.
Notion uses multiple, smaller LLM judges for specific tasks (e.g., one for a single thumbs-down output), as well as broader dataset evaluations.
They have experimented with automated prompt optimization but haven't fully cracked it for their specific problems.
Thumbs-down user feedback is used to identify functional gaps but doesn't directly align with LLM-as-a-judge scores, which evaluate specific experiments.
LLM-as-a-judge scoring is on a scale, but Notion often treats scores below a certain threshold as equal failures, summarizing these failures with another LLM for engineers.
A key pitfall is over-relying on a single LLM judge for all regressions; it's better to have specific judges for particular tasks (e.g., markdown formatting, language following) and to diligently inspect the losses.
Isolating RAG (Retrieval-Augmented Generation) from pure generation for evaluation is complex due to changing data indexes. Notion freezes retrieval results to evaluate retrieval separately, then evaluates generation assuming the frozen results, discarding samples where the answer is no longer in the index.
Notion manages hundreds of prompts, with dependencies handled through evals and variations. A major challenge is managing outages of model providers, which led to implementing a system for prompt ownership and config-based fallback mechanisms.
Braintrust helps AI teams answer critical questions, such as detecting performance regressions, selecting optimal models (cheapest, best), ensuring brand consistency, and debugging underperforming responses.
The platform aims to provide a system for statistical analysis to proactively and reactively catch mistakes, helping businesses move faster, reduce costs, and scale teams by enabling collaboration between technical and non-technical staff.
Core concepts include prompt engineering (tracking versions, rapid iteration), automated evals (SDK-triggered, 0-1 score), and observability (monitoring production, capturing user feedback).
An "eval" is defined as a structured test measuring AI system quality, reliability, and correctness across scenarios, with user-defined criteria.
An eval comprises three components: a "task" (the code/prompt being evaluated, from simple LLM calls to complex agents), a "data set" (real-world examples with input, optional expected output, and metadata), and a "score" (the logic, either LLM as a judge or heuristic functions, outputting 0-100).
Braintrust supports both "offline evals" (structured tests on predefined datasets for iteration) and "online evals" (real-time tracing and scoring of live production traffic for monitoring and feedback).
The Braintrust UI offers "Playgrounds" for quick, ephemeral iteration and "Experiments" for saving snapshots and tracking historical performance metrics like cost and duration.
A workshop demonstrates Braintrust setup, including Node.js version requirements and OpenAI API key configuration.
Users clone a repository that, upon installation, automatically creates prompts, scores, and a dataset within their Braintrust project, linking code to the platform.
The demo showcases features like mustache syntax in prompts for injecting data from datasets, and specialized scores for completeness and formatting.
Users run evals in the Braintrust Playground, allowing parallel execution against the dataset and generating scores for comparison.
The Playground can be used to create "Experiments," which track scores over time, enabling comparison of model performance and cost.
A bonus feature of automated prompt optimization is shown, where AI can suggest prompt changes based on evaluation results and even enhance datasets.
Braintrust allows adding custom AI models and managing unbounded scores by normalizing them to a 0-1 range.
The platform supports evaluating multi-turn conversations and agentic workflows, allowing scoring of individual "spans" (intermediate LLM calls) within a complex application trace.
Braintrust offers Python and TypeScript SDKs, with other language wrappers and a Go SDK in development.
Compared to Langsmith, Braintrust highlights its intuitive UI/UX and superior performance/scale due to its custom-built underlying infrastructure.
Setting up logging in Braintrust measures the quality of live traffic, enabling alerts for performance dips and faster debugging through visibility into LLM calls and function execution.
Logging is simplified by wrapping LLM clients (e.g., OpenAI, Vercel AI SDK) or arbitrary functions, capturing metrics like tokens, latency, and cost.
Online scoring applies pre-configured scores from offline evals to live production traffic, providing real-time quality assurance and enabling filtering of logs to identify underperforming use cases for further iteration.
Braintrust allows AB testing by tagging logs and comparing online scores, and users can define custom "views" in the UI to quickly filter and sort logs based on relevant criteria.
The demonstration shows spinning up a local application, generating logs in Braintrust, configuring online scoring rules, and viewing scored logs with the ability to drill down into individual spans and add them to datasets for continuous improvement.
Braintrust offers out-of-the-box scores (auto eval package) in addition to custom ones.
Human in the loop is crucial for industries where AI mistakes have severe consequences (e.g., healthcare, finance), helping catch hallucinations, establish ground truth, and incorporate user feedback.
Two types of human in the loop are supported:
Human Review: Annotators or subject matter experts manually label data sets or logs, scoring outputs or auditing edge cases, which is vital for ground truth and evaluating LLM as a judge.
User Feedback: Direct user input (e.g., thumbs up/down, comments) is captured and integrated into the iteration process, influencing prompt changes and data set augmentation.
Braintrust's logFeedback function allows applications to send user feedback (score, comment, metadata) to the platform, where it appears in logs and can be used to create custom views or feed into human review workflows.
Human evaluators access a tailored interface to review logs and apply configured scores (options, sliders, free-form text).
Braintrust supports versioning for prompts and scores, and it is believed to support tool versioning as well.
Remote Evals is a new feature addressing the limitations of the Playground for complex tasks involving intermediate code steps or external systems in a user's VPC.
It allows users to expose their local development environment to the Braintrust Playground, enabling evaluation of complex tasks and scores without pushing all tooling to Braintrust.
This bridges the gap between highly technical teams building complex tasks in the SDK and non-technical teams who want to iterate on prompts and parameters directly in the Playground.
The process involves starting a remote eval server locally (e.g., with brain trust eval -dev), which Braintrust then sends requests to, running the task and score logic locally and returning results to the Playground.
This feature allows for isolating and re-running specific test cases more efficiently, saving costs and accelerating development.