How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit

Introduction and Scale 00:03

  • Jaspreet Singh introduces himself as a Senior Staff Engineer at Intuit working on generative AI for TurboTax
  • TurboTax processed 44 million tax returns for tax year 2023, highlighting the scale of their operations
  • Key goal is to ensure users have high confidence in their tax filings and understand deductions and outcomes

Intuit's Generative AI Platform (GENOS) and Architecture 01:40

  • Intuit built its own generative OS platform called GENOS, designed for scalability, safety, and regulatory compliance
  • GENOS includes components like GenUX (UI), Orchestrator, and multiple LLM (large language model) solutions
  • Experience is delivered as "Intuit Assist," which powers question answering and guidance in TurboTax

LLM Usage and Query Types 03:04

  • First iteration focused on prompt-based solutions for explaining tax outcomes (e.g., tax refunds)
  • Uses Anthropic’s Claude for production, with a multi-million dollar contract; OpenAI models are used for other dynamic Q&A use cases
  • Differentiates between static queries (prepared, summary-type prompts) and dynamic queries (user-generated questions about tax scenarios)
  • Incorporates RAG (Retrieval-Augmented Generation), GraphRAG, and fine-tuned LLMs for improved accuracy and personalization

Evaluation Process and Role of Human Experts 05:54

  • Human tax analysts play a key role in prompt engineering, keeping models up-to-date with IRS changes and accuracy requirements
  • Evaluation follows a phased approach: starts with manual/human expert evaluation, then progresses to automated (LLM-as-a-judge) evaluations
  • Automated evaluation and monitoring tools are built in-house; AWS Ground Truth is used for manual sample data sets

Fine-Tuning, Model Selection, and Data Security 07:22

  • Fine-tuning of models (Claude 3 Haiku on AWS Bedrock) helps reduce prompt size, improve quality, and manage latency
  • Switches between Anthropic models (from Claude Instant to Haiku) are based on evaluation outcomes and require clear testing procedures
  • User data used for model training is strictly consented, adhering to relevant regulations

Core Learnings and Operational Challenges 10:09

  • Vendor contracts are expensive and can lead to vendor/model lock-in, making upgrades challenging even within the same vendor
  • LLM latency challenges: response times of several seconds; complexity increases with users’ detailed tax information, especially near tax deadlines
  • Product and fallback designs are implemented to ensure a seamless and user-friendly experience even under high latency
  • Rigorous evaluation (eval) processes are critical for launching and maintaining quality and regulatory compliance

Q&A: Evaluation Methods and Workflow Integration 12:20

  • Evaluation types vary by development phase: manual evaluation by tax experts for initial baselining, automated evaluation (LLM as judge) for ongoing prompt tweaks, manual review returns for major changes
  • User LLM interactions include product questions (how to use TurboTax features) and tax-specific questions (e.g., claiming tuition payments)
  • Intuit uses different system components to route/plan the proper solution for each question type

Q&A: Data Integrity, Personalization, and Safety 15:12

  • All numeric tax data comes from Intuit’s proprietary tax engine; LLMs do not perform calculations, ensuring ground-truth accuracy
  • Security systems and guardrails prevent hallucinated numbers from being included in user explanations
  • ML models are used to verify the accuracy of numbers in the final user-facing output

Q&A: RAG Approaches, Personalization, and Future Models 16:42

  • Hybrid use of standard RAG and GraphRAG, with GraphRAG providing better, more personalized answers for users
  • Ongoing evaluation of new LLM models and custom in-house models; future adoption decisions yet to be made

Q&A: Legal, Privacy, and Explanation Traceability 17:53

  • All complex tax answers are based on data from the tax engine, with prompts crafted and tested by tax experts
  • Legal and privacy controls are strictly enforced to prevent regulatory errors or legal issues
  • Explanations are constructed using validated systems to ensure traceability and correctness