SUMM

Estefano Ermon has been researching generative models since 2014-2015, initially focusing on image generation using GANs.
Dissatisfaction with GANs led to early work on diffusion models, which generate images by iterative refinement rather than in one shot.
Ermon’s academic work contributed to foundational diffusion models for image generation, influencing industry adoption for images and video.
Attempts to use diffusion models for discrete data (text, code, DNA) began around 2020-2021; this was challenging but ultimately successful at small scales.
Inception Labs formed in summer 2023 to develop large-scale, commercial diffusion language models, resulting in Mercury and Mercury Coder offerings.

Standard language models are autogressive, predicting the next token one at a time; diffusion models refine full sequences in multiple steps, modifying several tokens at once.
Diffusion models for text are trained by adding noise to sequences (e.g., masking, flipping tokens) and teaching the model to denoise and reconstruct the original.
Parallel editing allows diffusion models to generate responses much faster than traditional autogressive models.

Mercury diffusion language models use transformer architectures.
It's difficult to adapt existing pre-trained autogressive LLMs due to the radically different training objectives; diffusion models are non-causal and use full context.
Some elements such as architecture and training data can be reused, but losses/objectives need innovation.
Training leverages standard datasets, but involves novel training losses and objectives distinct from autogressive models.

New mathematical insights and the development of score-matching objectives for discrete data enabled recent breakthroughs.
Proof-of-concept implementations incentivized more work and rapid progress in the field, with industry actors like Google working on similar concepts.
The technology is now proving competitive in benchmarks and use cases such as infilling and code completion.

Similar to autogressive LLMs, diffusion LLMs can undergo supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).
The DPO (Direct Preference Optimization) algorithm, originally for autogressive models, has been adapted for diffusion models.
Pipelines for customer fine-tuning on proprietary data are compatible and straightforward.

Inference with diffusion language models is significantly more efficient than autogressive LLMs: better throughput for the same latency or vice-versa.
This results in lower compute costs and faster response times, making diffusion LLMs attractive for latency-sensitive applications.
Diffusion LLMs can enable larger, higher-quality models to fit real-time use cases that previously required smaller, faster (but less capable) autogressive models.

Mercury Coder’s intelligence score is comparable to other small, speed-optimized closed-source models, but with 5–10x speed improvements.
Generalist Mercury models also approach the quality of established models like GPT-4.1 (nano) but are much faster.
Diffusion LLMs are particularly strong in latency-sensitive settings: real-time voice agents, live-coding tools, or UI applications.
The models are not yet suitable for all "frontier" use cases, but the main limitation is current intelligence—not architectural constraints.

Inception Labs does not plan to open source its diffusion models, primarily due to proprietary inference engine and code.
The inference engine is custom-built to optimize continuous batching, quantization, and kernel implementation, presenting unique engineering challenges and opportunities.
Production workloads for diffusion LLMs present new optimization possibilities distinct from autogressive LLMs.
Interested engineers and researchers are invited to join the team, especially those excited by novel ML serving problems.

Ermon predicts a possible shift where most LLMs could eventually use diffusion due to superior inference efficiency and scalability.
The main driver would be the growing demand for tokens and the need to maximize data center and energy efficiency.
Diffusion models can already handle long context lengths (up to 128k tokens) using standard training pipelines, and ongoing research focuses on implementing effective caching strategies.
Ongoing R&D is expected to yield significant improvements as the field is still new and suboptimal in many design choices.

Diffusion LLMs are ideal for customers needing higher quality at low latency—especially if current autogressive solutions require small models that sacrifice performance.
Biggest strengths manifest in latency-sensitive applications where end-user feedback must be fast and accurate.
Inception Labs works with organizations interested in fine-tuning models on proprietary data to improve their quality and deployment efficiency.

⚡️Mercury: Ultra-Fast Diffusion LLMs — Estefano Ermon, CEO Inception Labs