SUMM

Ambience is an AI assistant for doctors, aimed at helping with note-taking and administrative tasks to save clinicians time and allow more patient care
The main customers are large health systems (e.g., Cleveland Clinic, UCSF), while the end users are clinicians (doctors and nurse practitioners)
Doctors use a mobile app in the exam room, which records and transcribes audio, then uses fine-tuned language models to generate structured and unstructured medical documentation
This documentation is automatically written back into the Electronic Health Record (EHR) systems
Typical usage can save clinicians up to two hours per day

EHRs are essential but fragmented, with major players like Epic (over 50% market share), Cerner (about 27%), Athena, and MedTech
Ambience integrates with these EHRs using open and private APIs for seamless clinician experiences
Brendan Fortuna’s background includes work on self-driving technology at Cruise, bringing ML and data engine concepts to healthcare AI

Recent years have seen a shift from manual dataset labeling and small models to massive general models and more efficient fine-tuning
Core ML principles remain datadriven: understanding data distribution, annotation quality, and success metrics
Data is especially challenging in healthcare due to privacy, regulatory, and quality issues

The initial use case is “ambient scribing”—recording patient-doctor conversations and generating notes—but this is only part of the value
Given audio transcripts and EHR access, Ambience is exploring more AI-powered functions: placing orders, coding, outbound patient calls, and other generative use cases
Technology stack includes prompting, chaining, retrieval-augmented generation (RAG), supervised fine-tuning (SFT), and reinforcement learning from human feedback (RFT)

RFT is an RL-based method for fine-tuning language models, teaching them reasoned and objective decision-making, suitable for STEM, math, coding, and medical tasks
Unlike supervised approaches, RFT uses programmable graders (scoring functions) instead of labeled datasets; models learn by maximizing these scores through iterative optimization
Graders can include string match, regex, fuzzy match, unit tests, LLM-based evaluation, or combinations
RFT is attractive in healthcare for its objectivity and sample efficiency (one example yields many training signals)
Caution: Data efficiency can lead to overfitting or “reward hacking” where models game the grader in unintended ways

Using LLM graders for long-form generative tasks led to issues: models inflated findings for better precision and used non-professional language
Solutions involved updating graders with constraints and weighing both content accuracy and stylistic correctness

ICD10 coding is the process of mapping diagnoses/conditions to 70,000 standardized codes, a task doctors often find taxing and error-prone
Ambience ran an evaluation with 18 physicians; expert annotation gave clinicians an F1 score of around 40%
RFT tuning improved a small model’s F1 to approximately 57%, indicating meaningful gains but highlighting more progress needed

There is strong unmet demand for AI in healthcare due to limited prior innovation and regulatory obstacles
Prototypes include patient-facing agents to handle medication reminders and administrative outreach after visits, reducing manual staff work
Other products like “pre-charting” summarize a patient’s medical history for the doctor before a visit, saving time and making consultations more productive

RFT can be costly, especially with expensive graders (e.g., burning $25K on one experiment with only 100 examples)
Observability tools like Brain Trust are used for tracking LLM spend, ensuring quality, and supporting on-prem deployment; much custom infrastructure is still needed for monitoring and automation
For evaluation, objective tasks suit automated scripts, but ergonomic tooling for domain expert review (like side-by-side comparisons) is still in early development

Domain experts excel at debugging and annotating outputs but may lack intuitions about which ML methods to try
Machine learning engineers are needed to guide, automate, and multiply domain expert efforts
Ideal teams combine both groups for effective product development and deployment

Out-of-the-box language models still struggle with reliability and can make unsafe inferences (e.g., “diagnosing” issues doctors haven’t confirmed)
Real-world clinical data is often out of distribution for these models because of privacy restrictions and missing “residency” (real-world) knowledge
Fine-tuning on actual clinical data is key to improvement, but access and data quality are persistent challenges

Open-source efforts like Healthbench offer more realistic medical benchmarks, but there’s still a gap due to lack of real clinical data
Sharing anonymized data is a potential path forward but conflicts with privacy needs
Clinical “taste” refers to judgment about which sources and data are trustworthy—models need to be trained on more than just textbooks to develop this intuitiveness

Different types of reasoning may not transfer between domains (e.g., math vs. medical reasoning); medical models need domain-specific training
Ambience envisions AI as day-to-day research and practice assistants, similar to coding copilots, running long tasks, surfacing insights, and aiding annotation

There's a strong need for top-tier ML research talent and for clinician-researchers who combine domain expertise with experimental and startup mindsets
The ideal hire is someone at the intersection of startup drive, domain knowledge, and ML skillset

Using RFT to Build Clinical Superintelligence