[Full Workshop] Building Conversational AI Agents - Thor Schaeff, ElevenLabs

Introduction & Workshop Overview 00:00

  • The workshop focuses on building multilingual conversational AI agents using ElevenLabs tools
  • Attendees are invited to access workshop slides via QR code, which includes additional resources and a form for free credits
  • The presenters encourage feedback on developer experience, documentation, and examples
  • ElevenLabs Devs Twitter account is recommended for developers interested in updates

Language Support and Fun Demonstrations 03:36

  • Participants are asked which languages they're interested in supporting; Portuguese, Spanish, Hungarian, Mandarin, and Hindi are mentioned
  • ElevenLabs is working on expanding to more languages with version 3 of its multilingual models
  • Joke demonstration of "text to bark" (for dog applications) and sound effects model, with playful mention of April Fools' Day
  • The platform includes various audio tools, such as a drum machine with AI-generated sound effects

Architecture of Conversational AI Agents 08:29

  • The conversational agent pipeline: speech-to-text (ASR) → large language model (LLM) → text-to-speech
  • ElevenLabs does not provide its own LLM, but integrates with providers like OpenAI and Google Gemini
  • Text-based processing allows for better monitoring than sound-to-sound approaches, useful for debugging and scaling
  • Models are deployed close to each other to minimize latency

Speech-to-Text Model Features & Demos 10:56

  • ElevenLabs' ASR model supports 99 languages, provides word-level timestamps, speaker diarization, and audio event tagging
  • Example demo: Telegram bot transcribing multilingual voice messages, including various accents (e.g., Singaporean English, Scottish, heavily accented English)
  • Model works well even with poor pronunciation and background noise

Integration and Customization of Intelligence Layer 17:08

  • ElevenLabs partners with external LLM providers and supports plugging in custom fine-tuned models via OpenAI-compatible APIs
  • When the LLM streams a response, ElevenLabs streams speech output in real-time for low latency

Text-to-Speech Voice Library & Marketplace 18:05

  • Platform offers over 5,000 available voices, covering many languages and accents, with filters for language, accent, gender, and age
  • Users can publish their own cloned voices to the library and receive royalties when others use them (over $5 million paid out)
  • Example given of filtering and selecting a Brazilian Portuguese accent

Building and Configuring Multilingual Agents 20:34

  • Agents can be built and configured directly in the ElevenLabs dashboard, with language selection, LLM choice, and knowledge base integration
  • Supports 31 languages for agents at present, with upcoming expansion to 99
  • Includes system tools for language detection, enabling agents to switch languages automatically or on request

Live Demo: Language Switching Agent 23:35

  • Demo of an agent configured for Singapore's four official languages, able to switch between them and answer in each
  • Two main modes for language detection: automatic switching based on detected speech or explicit language change on user request

Developer Resources and Deployment Options 26:12

  • Examples and documentation available for integrating ElevenLabs agents into Next.js, Python, and other environments (including hardware)
  • Full agent configuration is possible via API, useful for marketplaces or custom integrations
  • An MCP server allows natural language setup of agents

Q&A: Technical Implementation 28:47

  • Language switching relies on ASR confidence scores and user-configured voices per language; accents can be matched to locale (e.g., Chennai Tamil)
  • System tools automatically route the conversation to the correct voice and language based on detected speech

Q&A: Actions, Tools, and Function Calling 31:00

  • Agents support server-side tools through standard LLM function-calling (e.g., checking appointments via webhooks)
  • Integration with third-party APIs (e.g., cal.com) enables scheduling and other complex tasks from within conversations

Q&A: Low Latency and Model Selection 33:01

  • Model choice affects latency and price; more powerful models (like GPT-4) are slower
  • For lowest latency, use lightweight LLMs (e.g., Gemini Flash) and ElevenLabs’ speech “flash” models
  • Best model depends on use case and should be tested for performance vs. latency

Q&A: Pricing and Long Interactions 35:24

  • Pricing is per call minute, with included minutes per subscription tier and overage charges
  • Long-session or companion-type applications may be costly; custom pricing may be available upon request

Q&A: Multi-Tasking & Multi-Agent Orchestration 37:16

  • Agents can be organized for specific tasks with agent-to-agent transfers handled as system tools
  • Different LLMs can be used for different agents/tasks; user experience can remain seamless if the voice stays the same

Q&A: Enterprise Latency & Conversation Management 39:03

  • For high-latency enterprise use cases (e.g., slow database lookups), “filler” phrases are built-in so agents can inform users of delays
  • Tools can be configured with timeouts; the agent can keep the conversation natural during slow operations
  • Asynchronous responses may be injectible with sockets, but further investigation may be needed for branching conversations

Q&A: Handling Mixed-Language Inputs 43:58

  • Mixing two languages generally works, but recognizing three or more in the same input reduces accuracy
  • Language learning or mixed-language apps may need specific prompt engineering or consider sound-to-sound models
  • There is some support, but performance degrades as complexity increases

Q&A: Fraud, Safety & Moderation 49:00

  • ElevenLabs implements safety features such as live moderation, custom “do not say” lists for voice models, and watermarking of all generated speech
  • Can trace generated speech back to accounts, support reporting and moderation
  • Voice cloning requires user verification with random sentence reading

Q&A: Avatars and Downstream Integration 55:07

  • ElevenLabs currently focuses on voice but partners with avatar solution providers
  • Specific integration experiences with Nvidia’s avatar stack discussed; no immediate plans to expand further down the stack, but open to working with partners

Q&A: Custom Vocabulary and Pronunciation 57:47

  • Pronunciation dictionaries can be uploaded to control how generated speech pronounces acronyms or custom terms
  • Speech-to-text model lacks fine-tuning for acronyms, but LLM prompt engineering may help to normalize transcript
  • Example issue: "SAP" pronounced as a word instead of an acronym can be fixed for text-to-speech, but not fully for ASR yet

Workshop Wrap-Up 61:10

  • Attendees invited to connect, access credits, and visit the expo booth for further questions or demonstrations
  • The workshop concludes with thanks to the audience