Ambience is an AI assistant for doctors, aimed at helping with note-taking and administrative tasks to save clinicians time and allow more patient care
The main customers are large health systems (e.g., Cleveland Clinic, UCSF), while the end users are clinicians (doctors and nurse practitioners)
Doctors use a mobile app in the exam room, which records and transcribes audio, then uses fine-tuned language models to generate structured and unstructured medical documentation
This documentation is automatically written back into the Electronic Health Record (EHR) systems
Typical usage can save clinicians up to two hours per day
The initial use case is “ambient scribing”—recording patient-doctor conversations and generating notes—but this is only part of the value
Given audio transcripts and EHR access, Ambience is exploring more AI-powered functions: placing orders, coding, outbound patient calls, and other generative use cases
Technology stack includes prompting, chaining, retrieval-augmented generation (RAG), supervised fine-tuning (SFT), and reinforcement learning from human feedback (RFT)
RFT is an RL-based method for fine-tuning language models, teaching them reasoned and objective decision-making, suitable for STEM, math, coding, and medical tasks
Unlike supervised approaches, RFT uses programmable graders (scoring functions) instead of labeled datasets; models learn by maximizing these scores through iterative optimization
Graders can include string match, regex, fuzzy match, unit tests, LLM-based evaluation, or combinations
RFT is attractive in healthcare for its objectivity and sample efficiency (one example yields many training signals)
Caution: Data efficiency can lead to overfitting or “reward hacking” where models game the grader in unintended ways
Practical Challenges and Reward Hacking in RFT 08:35
Using LLM graders for long-form generative tasks led to issues: models inflated findings for better precision and used non-professional language
Solutions involved updating graders with constraints and weighing both content accuracy and stylistic correctness
Evaluating Model Performance in ICD10 Coding 10:45
ICD10 coding is the process of mapping diagnoses/conditions to 70,000 standardized codes, a task doctors often find taxing and error-prone
Ambience ran an evaluation with 18 physicians; expert annotation gave clinicians an F1 score of around 40%
RFT tuning improved a small model’s F1 to approximately 57%, indicating meaningful gains but highlighting more progress needed
There is strong unmet demand for AI in healthcare due to limited prior innovation and regulatory obstacles
Prototypes include patient-facing agents to handle medication reminders and administrative outreach after visits, reducing manual staff work
Other products like “pre-charting” summarize a patient’s medical history for the doctor before a visit, saving time and making consultations more productive
RFT can be costly, especially with expensive graders (e.g., burning $25K on one experiment with only 100 examples)
Observability tools like Brain Trust are used for tracking LLM spend, ensuring quality, and supporting on-prem deployment; much custom infrastructure is still needed for monitoring and automation
For evaluation, objective tasks suit automated scripts, but ergonomic tooling for domain expert review (like side-by-side comparisons) is still in early development
Collaboration Between ML Engineers and Domain Experts 17:01
Domain experts excel at debugging and annotating outputs but may lack intuitions about which ML methods to try
Machine learning engineers are needed to guide, automate, and multiply domain expert efforts
Ideal teams combine both groups for effective product development and deployment
Handling Hallucinations and Model Robustness in Medical AI 18:34
Out-of-the-box language models still struggle with reliability and can make unsafe inferences (e.g., “diagnosing” issues doctors haven’t confirmed)
Real-world clinical data is often out of distribution for these models because of privacy restrictions and missing “residency” (real-world) knowledge
Fine-tuning on actual clinical data is key to improvement, but access and data quality are persistent challenges
Data Sharing and the Need for Clinical Taste 21:07
Open-source efforts like Healthbench offer more realistic medical benchmarks, but there’s still a gap due to lack of real clinical data
Sharing anonymized data is a potential path forward but conflicts with privacy needs
Clinical “taste” refers to judgment about which sources and data are trustworthy—models need to be trained on more than just textbooks to develop this intuitiveness
Different types of reasoning may not transfer between domains (e.g., math vs. medical reasoning); medical models need domain-specific training
Ambience envisions AI as day-to-day research and practice assistants, similar to coding copilots, running long tasks, surfacing insights, and aiding annotation
There's a strong need for top-tier ML research talent and for clinician-researchers who combine domain expertise with experimental and startup mindsets
The ideal hire is someone at the intersection of startup drive, domain knowledge, and ML skillset