SUMM

Current AI applications suffer from long wait times, especially when used with AI agents, making user experience slow.
Existing fixes are limited to using "dumber" models or investing in better hardware, both of which are suboptimal for most users.
Diffusion language models (Diffusion LMs) are emerging as a promising solution to overcome these speed limitations.
Inception Labs introduced the first commercial-ready diffusion LM, Mercury Coder Mini, capable of generating up to 1,119 tokens per second, five times faster than top small coding models like Gemini 2.0 Flashlight and Quinn 2.5 Coder 7B.
Google DeepMind also released Gemini Diffusion, which shows even faster generation speeds.

Unlike auto-regressive models that generate tokens sequentially, diffusion LMs start with a fully masked sequence and iteratively denoise it, predicting and unmasking parts of the sequence in parallel.
This enables bidirectional or global context awareness, making diffusion LMs suitable for structured or logical tasks.
The architecture still relies on transformers; the key difference is in the denoising generation process.
The left-to-right generation effect seen in interfaces is often just a display setting; diffusion LMs can update tokens in parallel.

Mercury Coder Small (32K context window) is served via API at a cost of $0.25 per million input tokens and $1 per million output tokens.
Mercury achieves about 800 tokens per second on an H100 GPU, while the fastest auto-regressive models reach about 331 tokens per second (mainstream) and up to 2,522 tokens per second with custom hardware.
On coding benchmarks, Mercury matches or surpasses lightweight code-focused models and excels in infilling tasks (22-30% higher accuracy).
In Copilot Arena benchmarks (measuring code autocompletion), Mercury places 6th among older non-reasoning models, reflecting ongoing progress but also current limitations.

Diffusion LMs in coding tasks are roughly 8-10 months behind top private models, and about 3 months behind top open-source models.
Gemini Diffusion from Google doubles Mercury's speed, generating up to 1,500 tokens per second with similar context window.
However, Gemini Diffusion lags behind auto-regressive models in language and reasoning (up to 16% lower accuracy in GPQA, 10% lower in multilingual QA, 6% lower in reasoning).
The speed and output of diffusion LMs make them attractive, but they struggle to match auto-regressive models on complex reasoning and language tasks due to scaling and data challenges.

To close performance gaps, diffusion LMs need more efficient training, especially with synthetic reasoning data, and could benefit from further exploration of model size and denoising techniques.
Open source diffusion LMs currently have practical challenges, including optimization and key-value caching issues that affect out-of-the-box speed.
Research indicates diffusion LMs learn more efficiently when data is limited but compute resources are abundant, gaining up to 25 times more learning signals from the same data compared to auto-regressive models.

Diffusion LMs show potential to outperform traditional models in multimodal tasks, particularly where unified vision and language processing is required.
Recent research (papers: Lavida and Mada) showcase how multimodal diffusion LMs can attend to both images and text globally, not just sequentially.
Lavida enables every token (text or image) to be attended by others at each iteration, allowing complex tasks like generating structured outputs or coordinated text-image data.
Mada, inspired by Chameleon, introduces a unified model with discrete mask diffusion, achieving strong results across text understanding, vision reasoning, and text-to-image tasks.
Mada's open source model matches or exceeds the performance of established language and vision models in several benchmarks, supporting consistent formatting and accelerated output.

Diffusion LMs offer global context at every step, enable faster multimodal outputs, and support more consistent structural control for tasks like infilling tables or generating formatted JSON.
Future videos may delve deeper into technical mechanisms, efficiency challenges, and strategies for unifying loss signals in multimodal diffusion LMs.
Readers are encouraged to subscribe for future deep-dives and check out the presenter's newsletter for detailed research coverage.

Gratitude expressed to patrons and supporters of the channel.
Social media and newsletter links provided for continued updates on AI research and diffusion LM development.

1,000 tok/s?! The Age of Diffusion Based LLMs Is Upon Us