ElevenLabs’ Mati Staniszewski: Why Voice Will Be the Fundamental Interface for Tech

Origin Story and Company Focus 00:00

  • The idea for 11 Labs originated from a poor dubbing experience in Poland, where foreign movies are typically narrated by a single monotonous voice.
  • 11 Labs built a strongly defensible position in voice AI by focusing narrowly on audio, in contrast to larger AI labs that pursued multimodality.
  • Initial technical innovation included bringing diffusion and transformer model ideas to audio, leading to higher-quality, more expressive text-to-speech models.
  • Effective deployment and usability of their systems for use cases like audiobooks and dubbing were as essential as model quality itself.
  • The founders, Mati and Peter, met in high school in Poland, maintained a long-term friendship, and collaborated on various technical projects before founding 11 Labs.

Technical Differences: Voice AI vs. Text AI 11:32

  • Voice AI poses unique challenges distinct from text AI, including limited availability of high-quality paired audio-transcript data.
  • Quality audio datasets with nuanced attributes (emotions, non-verbal cues, speaker characteristics) are rare compared to text datasets.
  • Early work required building pipelines combining speech-to-text, manual labeling, and emotion annotation.
  • Audio models must predict sounds (not tokens) and often require more advanced contextual understanding and expressivity (e.g., sarcasm, emotion).
  • The model architecture includes separate consideration of text and voice characteristics, enabling preservation of original voice qualities across languages.
  • The company took a flexible approach with model inputs, letting the model infer defining voice features rather than manually specifying them.

Building the Team and Research Approach 15:52

  • 11 Labs prioritizes hiring top global audio research talent and operates fully remote to access a broad pool.
  • Researchers are kept close to deployment to ensure rapid turnaround from innovation to user-facing improvements.
  • The team structure combines pure researchers, research engineers, and data labelers trained by voice coaches to capture nuanced audio attributes.
  • A small, highly independent, and ownership-driven team culture is crucial due to the niche nature of audio research.

Product Development Milestones and Virality 19:04

  • Early adoption was driven by releasing technology broadly to consumers and learning from unexpected user innovations.
  • Initial virality came from book authors using an early beta to produce AI-narrated audiobooks, and from being one of the first AI models able to convincingly replicate a laugh.
  • 11 Labs played a significant role in powering the "no face" narration trend in content creation.
  • Expansion into multiple European languages and launch of a dubbing product reinforced the original mission to make content accessible across languages and maintained the original voice quality.
  • High-profile partnerships included working with Epic Games on a Darth Vader voice for in-game agents, and translating interviews (e.g., Lex Fridman with Narendra Modi) for global audiences.
  • The rise of AI voice agents for both consumer and enterprise use has driven recurring waves of adoption.

The Rise and Challenges of Voice Agents 26:22

  • Widespread efforts exist to build voice-based agents, both in startups and enterprises, aiming for humanlike interaction.
  • Voice is seen as a fundamentally richer interface than text, conveying emotion, intent, and nonverbal communication.
  • Key use cases include healthcare (e.g., nurse call automation), customer support, and education (e.g., chess.com using iconic voices as learning aids).
  • For enterprises, successful deployment of agents depends not just on voice but also on integrating business logic and complex workflows.
  • Technical bottlenecks often reside more in backend business integration than the voice interface itself.

Engineering and Integrations 33:20

  • The main engineering challenges for enterprise deployments are integrations: phone systems, CRM, and other business software.
  • 11 Labs builds a flexible stack that allows companies to plug in knowledge bases, retrieve information in real-time, and use functions/integrations as needed.
  • Complexity increases with enterprise scale and the diversity of knowledge and integration needs.
  • They aim for broad interoperability, including with other AI foundation models (Anthropic, OpenAI, etc.), using cascading fallback mechanisms when models fail.

What Customers Care About 37:42

  • Voice quality (expressiveness and accuracy), latency (speed of interaction), and reliability at scale are the top three priorities for customers.
  • Trade-offs exist between quality and speed, particularly as systems approach human-level performance.

Approaching Human-Level Voice AI 39:14

  • Mati believes "human or superhuman" quality, near-zero latency voice interaction could be achieved as soon as this year or early 2026.
  • Achieving the Turing test for voice depends on whether a cascading (separate speech-to-text, LLM, text-to-speech) or unified/duplex model architecture is used.
  • Duplex models (jointly trained) may offer higher expressivity but current cascaded models are more reliable.

Vision for Technology’s Future 41:37

  • Voice will become the default interface for technology, with agents commonly used for learning, daily tasks, and global communication.
  • Education will be deeply personalized, with users having voice-based tutors and helpers.
  • Real-time translators will allow anyone to communicate naturally across languages with their own voice and emotional tone preserved.
  • Agents will handle tasks ranging from booking reservations to note-taking, often through voice-mediated agent-to-agent communication.

Safety and Authentication in Voice AI 44:33

  • 11 Labs ensures all generated audio can be traced to its source account (provenance).
  • Ongoing work includes voice and text moderation, fraud detection, and collaborative efforts to develop detection systems for AI-generated content (with universities, other companies).
  • As AI voice use spreads, verification mechanisms for both AI and human callers are expected to become more pervasive.

Building and Operating from Europe 46:52

  • Operating in Europe brought access to passionate, highly talented researchers, especially in Central and Eastern Europe.
  • European location supported a global, multilingual product vision from the start.
  • Challenges included less well-developed entrepreneurial networks and ecosystems compared to the US.
  • Regulatory uncertainty in the EU (e.g., AI Act) may slow innovation, though enthusiasm and adoption rates are improving.

Quickfire Round & Personal Insights 51:22

  • Mati’s favorite AI tools include Perplexity (for deep source understanding), Google Maps, and Lovable (for prototyping with 11 Labs).
  • He admires Demis Hassabis for his research-to-leadership trajectory and impact on multiple domains.
  • Underhyped prediction: cross-lingual voice communication will profoundly reshape global society, but enabling devices and form factors are still evolving (e.g., headphones, glasses, or eventually neural interfaces).