SUMM

Jaspreet Singh introduces himself as a Senior Staff Engineer at Intuit working on generative AI for TurboTax
TurboTax processed 44 million tax returns for tax year 2023, highlighting the scale of their operations
Key goal is to ensure users have high confidence in their tax filings and understand deductions and outcomes

Intuit built its own generative OS platform called GENOS, designed for scalability, safety, and regulatory compliance
GENOS includes components like GenUX (UI), Orchestrator, and multiple LLM (large language model) solutions
Experience is delivered as "Intuit Assist," which powers question answering and guidance in TurboTax

First iteration focused on prompt-based solutions for explaining tax outcomes (e.g., tax refunds)
Uses Anthropic’s Claude for production, with a multi-million dollar contract; OpenAI models are used for other dynamic Q&A use cases
Differentiates between static queries (prepared, summary-type prompts) and dynamic queries (user-generated questions about tax scenarios)
Incorporates RAG (Retrieval-Augmented Generation), GraphRAG, and fine-tuned LLMs for improved accuracy and personalization

Human tax analysts play a key role in prompt engineering, keeping models up-to-date with IRS changes and accuracy requirements
Evaluation follows a phased approach: starts with manual/human expert evaluation, then progresses to automated (LLM-as-a-judge) evaluations
Automated evaluation and monitoring tools are built in-house; AWS Ground Truth is used for manual sample data sets

Fine-tuning of models (Claude 3 Haiku on AWS Bedrock) helps reduce prompt size, improve quality, and manage latency
Switches between Anthropic models (from Claude Instant to Haiku) are based on evaluation outcomes and require clear testing procedures
User data used for model training is strictly consented, adhering to relevant regulations

Vendor contracts are expensive and can lead to vendor/model lock-in, making upgrades challenging even within the same vendor
LLM latency challenges: response times of several seconds; complexity increases with users’ detailed tax information, especially near tax deadlines
Product and fallback designs are implemented to ensure a seamless and user-friendly experience even under high latency
Rigorous evaluation (eval) processes are critical for launching and maintaining quality and regulatory compliance

Evaluation types vary by development phase: manual evaluation by tax experts for initial baselining, automated evaluation (LLM as judge) for ongoing prompt tweaks, manual review returns for major changes
User LLM interactions include product questions (how to use TurboTax features) and tax-specific questions (e.g., claiming tuition payments)
Intuit uses different system components to route/plan the proper solution for each question type

All numeric tax data comes from Intuit’s proprietary tax engine; LLMs do not perform calculations, ensuring ground-truth accuracy
Security systems and guardrails prevent hallucinated numbers from being included in user explanations
ML models are used to verify the accuracy of numbers in the final user-facing output

Hybrid use of standard RAG and GraphRAG, with GraphRAG providing better, more personalized answers for users
Ongoing evaluation of new LLM models and custom in-house models; future adoption decisions yet to be made

All complex tax answers are based on data from the tax engine, with prompts crafted and tested by tax experts
Legal and privacy controls are strictly enforced to prevent regulatory errors or legal issues
Explanations are constructed using validated systems to ensure traceability and correctness

How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit