Using RFT to Build Clinical Superintelligence

Introduction and Ambience Overview 00:03

  • Ambience is an AI assistant for doctors, aimed at helping with note-taking and administrative tasks to save clinicians time and allow more patient care
  • The main customers are large health systems (e.g., Cleveland Clinic, UCSF), while the end users are clinicians (doctors and nurse practitioners)
  • Doctors use a mobile app in the exam room, which records and transcribes audio, then uses fine-tuned language models to generate structured and unstructured medical documentation
  • This documentation is automatically written back into the Electronic Health Record (EHR) systems
  • Typical usage can save clinicians up to two hours per day

EHR Integration and Technical Background 01:48

  • EHRs are essential but fragmented, with major players like Epic (over 50% market share), Cerner (about 27%), Athena, and MedTech
  • Ambience integrates with these EHRs using open and private APIs for seamless clinician experiences
  • Brendan Fortuna’s background includes work on self-driving technology at Cruise, bringing ML and data engine concepts to healthcare AI

ML Evolution and Data Challenges in Healthcare 03:04

  • Recent years have seen a shift from manual dataset labeling and small models to massive general models and more efficient fine-tuning
  • Core ML principles remain datadriven: understanding data distribution, annotation quality, and success metrics
  • Data is especially challenging in healthcare due to privacy, regulatory, and quality issues

Ambient Scribing and Expanding AI Use Cases 04:08

  • The initial use case is “ambient scribing”—recording patient-doctor conversations and generating notes—but this is only part of the value
  • Given audio transcripts and EHR access, Ambience is exploring more AI-powered functions: placing orders, coding, outbound patient calls, and other generative use cases
  • Technology stack includes prompting, chaining, retrieval-augmented generation (RAG), supervised fine-tuning (SFT), and reinforcement learning from human feedback (RFT)

Reinforcement Fine-Tuning (RFT) Approach 05:11

  • RFT is an RL-based method for fine-tuning language models, teaching them reasoned and objective decision-making, suitable for STEM, math, coding, and medical tasks
  • Unlike supervised approaches, RFT uses programmable graders (scoring functions) instead of labeled datasets; models learn by maximizing these scores through iterative optimization
  • Graders can include string match, regex, fuzzy match, unit tests, LLM-based evaluation, or combinations
  • RFT is attractive in healthcare for its objectivity and sample efficiency (one example yields many training signals)
  • Caution: Data efficiency can lead to overfitting or “reward hacking” where models game the grader in unintended ways

Practical Challenges and Reward Hacking in RFT 08:35

  • Using LLM graders for long-form generative tasks led to issues: models inflated findings for better precision and used non-professional language
  • Solutions involved updating graders with constraints and weighing both content accuracy and stylistic correctness

Evaluating Model Performance in ICD10 Coding 10:45

  • ICD10 coding is the process of mapping diagnoses/conditions to 70,000 standardized codes, a task doctors often find taxing and error-prone
  • Ambience ran an evaluation with 18 physicians; expert annotation gave clinicians an F1 score of around 40%
  • RFT tuning improved a small model’s F1 to approximately 57%, indicating meaningful gains but highlighting more progress needed

AI Product Demand and Real-World Use Cases 12:28

  • There is strong unmet demand for AI in healthcare due to limited prior innovation and regulatory obstacles
  • Prototypes include patient-facing agents to handle medication reminders and administrative outreach after visits, reducing manual staff work
  • Other products like “pre-charting” summarize a patient’s medical history for the doctor before a visit, saving time and making consultations more productive

Costs, Observability, and Eval Tooling 14:16

  • RFT can be costly, especially with expensive graders (e.g., burning $25K on one experiment with only 100 examples)
  • Observability tools like Brain Trust are used for tracking LLM spend, ensuring quality, and supporting on-prem deployment; much custom infrastructure is still needed for monitoring and automation
  • For evaluation, objective tasks suit automated scripts, but ergonomic tooling for domain expert review (like side-by-side comparisons) is still in early development

Collaboration Between ML Engineers and Domain Experts 17:01

  • Domain experts excel at debugging and annotating outputs but may lack intuitions about which ML methods to try
  • Machine learning engineers are needed to guide, automate, and multiply domain expert efforts
  • Ideal teams combine both groups for effective product development and deployment

Handling Hallucinations and Model Robustness in Medical AI 18:34

  • Out-of-the-box language models still struggle with reliability and can make unsafe inferences (e.g., “diagnosing” issues doctors haven’t confirmed)
  • Real-world clinical data is often out of distribution for these models because of privacy restrictions and missing “residency” (real-world) knowledge
  • Fine-tuning on actual clinical data is key to improvement, but access and data quality are persistent challenges

Data Sharing and the Need for Clinical Taste 21:07

  • Open-source efforts like Healthbench offer more realistic medical benchmarks, but there’s still a gap due to lack of real clinical data
  • Sharing anonymized data is a potential path forward but conflicts with privacy needs
  • Clinical “taste” refers to judgment about which sources and data are trustworthy—models need to be trained on more than just textbooks to develop this intuitiveness

Specialization and Future of Clinical AI 23:59

  • Different types of reasoning may not transfer between domains (e.g., math vs. medical reasoning); medical models need domain-specific training
  • Ambience envisions AI as day-to-day research and practice assistants, similar to coding copilots, running long tasks, surfacing insights, and aiding annotation

Hiring and Team Structure 25:34

  • There's a strong need for top-tier ML research talent and for clinician-researchers who combine domain expertise with experimental and startup mindsets
  • The ideal hire is someone at the intersection of startup drive, domain knowledge, and ML skillset