⚡️Mercury: Ultra-Fast Diffusion LLMs — Estefano Ermon, CEO Inception Labs
Origins and Development of Diffusion Language Models 00:03
Estefano Ermon has been researching generative models since 2014-2015, initially focusing on image generation using GANs.
Dissatisfaction with GANs led to early work on diffusion models, which generate images by iterative refinement rather than in one shot.
Ermon’s academic work contributed to foundational diffusion models for image generation, influencing industry adoption for images and video.
Attempts to use diffusion models for discrete data (text, code, DNA) began around 2020-2021; this was challenging but ultimately successful at small scales.
Inception Labs formed in summer 2023 to develop large-scale, commercial diffusion language models, resulting in Mercury and Mercury Coder offerings.
Standard language models are autogressive, predicting the next token one at a time; diffusion models refine full sequences in multiple steps, modifying several tokens at once.
Diffusion models for text are trained by adding noise to sequences (e.g., masking, flipping tokens) and teaching the model to denoise and reconstruct the original.
Parallel editing allows diffusion models to generate responses much faster than traditional autogressive models.
Mercury diffusion language models use transformer architectures.
It's difficult to adapt existing pre-trained autogressive LLMs due to the radically different training objectives; diffusion models are non-causal and use full context.
Some elements such as architecture and training data can be reused, but losses/objectives need innovation.
Training leverages standard datasets, but involves novel training losses and objectives distinct from autogressive models.
Inference with diffusion language models is significantly more efficient than autogressive LLMs: better throughput for the same latency or vice-versa.
This results in lower compute costs and faster response times, making diffusion LLMs attractive for latency-sensitive applications.
Diffusion LLMs can enable larger, higher-quality models to fit real-time use cases that previously required smaller, faster (but less capable) autogressive models.
Inception Labs does not plan to open source its diffusion models, primarily due to proprietary inference engine and code.
The inference engine is custom-built to optimize continuous batching, quantization, and kernel implementation, presenting unique engineering challenges and opportunities.
Production workloads for diffusion LLMs present new optimization possibilities distinct from autogressive LLMs.
Interested engineers and researchers are invited to join the team, especially those excited by novel ML serving problems.
The Future of Diffusion LLMs and Industry Trends 23:15
Ermon predicts a possible shift where most LLMs could eventually use diffusion due to superior inference efficiency and scalability.
The main driver would be the growing demand for tokens and the need to maximize data center and energy efficiency.
Diffusion models can already handle long context lengths (up to 128k tokens) using standard training pipelines, and ongoing research focuses on implementing effective caching strategies.
Ongoing R&D is expected to yield significant improvements as the field is still new and suboptimal in many design choices.
Diffusion LLMs are ideal for customers needing higher quality at low latency—especially if current autogressive solutions require small models that sacrifice performance.
Biggest strengths manifest in latency-sensitive applications where end-user feedback must be fast and accurate.
Inception Labs works with organizations interested in fine-tuning models on proprietary data to improve their quality and deployment efficiency.