⚡️Mercury: Ultra-Fast Diffusion LLMs — Estefano Ermon, CEO Inception Labs

Origins and Development of Diffusion Language Models 00:03

  • Estefano Ermon has been researching generative models since 2014-2015, initially focusing on image generation using GANs.
  • Dissatisfaction with GANs led to early work on diffusion models, which generate images by iterative refinement rather than in one shot.
  • Ermon’s academic work contributed to foundational diffusion models for image generation, influencing industry adoption for images and video.
  • Attempts to use diffusion models for discrete data (text, code, DNA) began around 2020-2021; this was challenging but ultimately successful at small scales.
  • Inception Labs formed in summer 2023 to develop large-scale, commercial diffusion language models, resulting in Mercury and Mercury Coder offerings.

How Diffusion Models Work for Language 04:03

  • Standard language models are autogressive, predicting the next token one at a time; diffusion models refine full sequences in multiple steps, modifying several tokens at once.
  • Diffusion models for text are trained by adding noise to sequences (e.g., masking, flipping tokens) and teaching the model to denoise and reconstruct the original.
  • Parallel editing allows diffusion models to generate responses much faster than traditional autogressive models.

Training and Transferability 06:52

  • Mercury diffusion language models use transformer architectures.
  • It's difficult to adapt existing pre-trained autogressive LLMs due to the radically different training objectives; diffusion models are non-causal and use full context.
  • Some elements such as architecture and training data can be reused, but losses/objectives need innovation.
  • Training leverages standard datasets, but involves novel training losses and objectives distinct from autogressive models.

Why Diffusion LLMs Are Gaining Traction Now 09:30

  • New mathematical insights and the development of score-matching objectives for discrete data enabled recent breakthroughs.
  • Proof-of-concept implementations incentivized more work and rapid progress in the field, with industry actors like Google working on similar concepts.
  • The technology is now proving competitive in benchmarks and use cases such as infilling and code completion.

Fine-Tuning, Alignment, and Post-Training 12:39

  • Similar to autogressive LLMs, diffusion LLMs can undergo supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).
  • The DPO (Direct Preference Optimization) algorithm, originally for autogressive models, has been adapted for diffusion models.
  • Pipelines for customer fine-tuning on proprietary data are compatible and straightforward.

Inference Efficiency and Deployment 14:12

  • Inference with diffusion language models is significantly more efficient than autogressive LLMs: better throughput for the same latency or vice-versa.
  • This results in lower compute costs and faster response times, making diffusion LLMs attractive for latency-sensitive applications.
  • Diffusion LLMs can enable larger, higher-quality models to fit real-time use cases that previously required smaller, faster (but less capable) autogressive models.

Use Cases and Performance Comparison 16:34

  • Mercury Coder’s intelligence score is comparable to other small, speed-optimized closed-source models, but with 5–10x speed improvements.
  • Generalist Mercury models also approach the quality of established models like GPT-4.1 (nano) but are much faster.
  • Diffusion LLMs are particularly strong in latency-sensitive settings: real-time voice agents, live-coding tools, or UI applications.
  • The models are not yet suitable for all "frontier" use cases, but the main limitation is current intelligence—not architectural constraints.

Model Release, Ecosystem, and Serving 20:25

  • Inception Labs does not plan to open source its diffusion models, primarily due to proprietary inference engine and code.
  • The inference engine is custom-built to optimize continuous batching, quantization, and kernel implementation, presenting unique engineering challenges and opportunities.
  • Production workloads for diffusion LLMs present new optimization possibilities distinct from autogressive LLMs.
  • Interested engineers and researchers are invited to join the team, especially those excited by novel ML serving problems.

The Future of Diffusion LLMs and Industry Trends 23:15

  • Ermon predicts a possible shift where most LLMs could eventually use diffusion due to superior inference efficiency and scalability.
  • The main driver would be the growing demand for tokens and the need to maximize data center and energy efficiency.
  • Diffusion models can already handle long context lengths (up to 128k tokens) using standard training pipelines, and ongoing research focuses on implementing effective caching strategies.
  • Ongoing R&D is expected to yield significant improvements as the field is still new and suboptimal in many design choices.

Customer Fit and Ideal Problems 26:52

  • Diffusion LLMs are ideal for customers needing higher quality at low latency—especially if current autogressive solutions require small models that sacrifice performance.
  • Biggest strengths manifest in latency-sensitive applications where end-user feedback must be fast and accurate.
  • Inception Labs works with organizations interested in fine-tuning models on proprietary data to improve their quality and deployment efficiency.