The idea for 11 Labs originated from a poor dubbing experience in Poland, where foreign movies are typically narrated by a single monotonous voice.
11 Labs built a strongly defensible position in voice AI by focusing narrowly on audio, in contrast to larger AI labs that pursued multimodality.
Initial technical innovation included bringing diffusion and transformer model ideas to audio, leading to higher-quality, more expressive text-to-speech models.
Effective deployment and usability of their systems for use cases like audiobooks and dubbing were as essential as model quality itself.
The founders, Mati and Peter, met in high school in Poland, maintained a long-term friendship, and collaborated on various technical projects before founding 11 Labs.
Voice AI poses unique challenges distinct from text AI, including limited availability of high-quality paired audio-transcript data.
Quality audio datasets with nuanced attributes (emotions, non-verbal cues, speaker characteristics) are rare compared to text datasets.
Early work required building pipelines combining speech-to-text, manual labeling, and emotion annotation.
Audio models must predict sounds (not tokens) and often require more advanced contextual understanding and expressivity (e.g., sarcasm, emotion).
The model architecture includes separate consideration of text and voice characteristics, enabling preservation of original voice qualities across languages.
The company took a flexible approach with model inputs, letting the model infer defining voice features rather than manually specifying them.
Early adoption was driven by releasing technology broadly to consumers and learning from unexpected user innovations.
Initial virality came from book authors using an early beta to produce AI-narrated audiobooks, and from being one of the first AI models able to convincingly replicate a laugh.
11 Labs played a significant role in powering the "no face" narration trend in content creation.
Expansion into multiple European languages and launch of a dubbing product reinforced the original mission to make content accessible across languages and maintained the original voice quality.
High-profile partnerships included working with Epic Games on a Darth Vader voice for in-game agents, and translating interviews (e.g., Lex Fridman with Narendra Modi) for global audiences.
The rise of AI voice agents for both consumer and enterprise use has driven recurring waves of adoption.
Widespread efforts exist to build voice-based agents, both in startups and enterprises, aiming for humanlike interaction.
Voice is seen as a fundamentally richer interface than text, conveying emotion, intent, and nonverbal communication.
Key use cases include healthcare (e.g., nurse call automation), customer support, and education (e.g., chess.com using iconic voices as learning aids).
For enterprises, successful deployment of agents depends not just on voice but also on integrating business logic and complex workflows.
Technical bottlenecks often reside more in backend business integration than the voice interface itself.
The main engineering challenges for enterprise deployments are integrations: phone systems, CRM, and other business software.
11 Labs builds a flexible stack that allows companies to plug in knowledge bases, retrieve information in real-time, and use functions/integrations as needed.
Complexity increases with enterprise scale and the diversity of knowledge and integration needs.
They aim for broad interoperability, including with other AI foundation models (Anthropic, OpenAI, etc.), using cascading fallback mechanisms when models fail.
Mati believes "human or superhuman" quality, near-zero latency voice interaction could be achieved as soon as this year or early 2026.
Achieving the Turing test for voice depends on whether a cascading (separate speech-to-text, LLM, text-to-speech) or unified/duplex model architecture is used.
Duplex models (jointly trained) may offer higher expressivity but current cascaded models are more reliable.
11 Labs ensures all generated audio can be traced to its source account (provenance).
Ongoing work includes voice and text moderation, fraud detection, and collaborative efforts to develop detection systems for AI-generated content (with universities, other companies).
As AI voice use spreads, verification mechanisms for both AI and human callers are expected to become more pervasive.
Mati’s favorite AI tools include Perplexity (for deep source understanding), Google Maps, and Lovable (for prototyping with 11 Labs).
He admires Demis Hassabis for his research-to-leadership trajectory and impact on multiple domains.
Underhyped prediction: cross-lingual voice communication will profoundly reshape global society, but enabling devices and form factors are still evolving (e.g., headphones, glasses, or eventually neural interfaces).