[Full Workshop] Building Conversational AI Agents - Thor Schaeff, ElevenLabs Introduction & Workshop Overview 00:00
The workshop focuses on building multilingual conversational AI agents using ElevenLabs tools
Attendees are invited to access workshop slides via QR code, which includes additional resources and a form for free credits
The presenters encourage feedback on developer experience, documentation, and examples
ElevenLabs Devs Twitter account is recommended for developers interested in updates
Language Support and Fun Demonstrations 03:36
Participants are asked which languages they're interested in supporting; Portuguese, Spanish, Hungarian, Mandarin, and Hindi are mentioned
ElevenLabs is working on expanding to more languages with version 3 of its multilingual models
Joke demonstration of "text to bark" (for dog applications) and sound effects model, with playful mention of April Fools' Day
The platform includes various audio tools, such as a drum machine with AI-generated sound effects
Architecture of Conversational AI Agents 08:29
The conversational agent pipeline: speech-to-text (ASR) → large language model (LLM) → text-to-speech
ElevenLabs does not provide its own LLM, but integrates with providers like OpenAI and Google Gemini
Text-based processing allows for better monitoring than sound-to-sound approaches, useful for debugging and scaling
Models are deployed close to each other to minimize latency
Speech-to-Text Model Features & Demos 10:56
ElevenLabs' ASR model supports 99 languages, provides word-level timestamps, speaker diarization, and audio event tagging
Example demo: Telegram bot transcribing multilingual voice messages, including various accents (e.g., Singaporean English, Scottish, heavily accented English)
Model works well even with poor pronunciation and background noise
Integration and Customization of Intelligence Layer 17:08
ElevenLabs partners with external LLM providers and supports plugging in custom fine-tuned models via OpenAI-compatible APIs
When the LLM streams a response, ElevenLabs streams speech output in real-time for low latency
Text-to-Speech Voice Library & Marketplace 18:05
Platform offers over 5,000 available voices, covering many languages and accents, with filters for language, accent, gender, and age
Users can publish their own cloned voices to the library and receive royalties when others use them (over $5 million paid out)
Example given of filtering and selecting a Brazilian Portuguese accent
Building and Configuring Multilingual Agents 20:34
Agents can be built and configured directly in the ElevenLabs dashboard, with language selection, LLM choice, and knowledge base integration
Supports 31 languages for agents at present, with upcoming expansion to 99
Includes system tools for language detection, enabling agents to switch languages automatically or on request
Live Demo: Language Switching Agent 23:35
Demo of an agent configured for Singapore's four official languages, able to switch between them and answer in each
Two main modes for language detection: automatic switching based on detected speech or explicit language change on user request
Developer Resources and Deployment Options 26:12
Examples and documentation available for integrating ElevenLabs agents into Next.js, Python, and other environments (including hardware)
Full agent configuration is possible via API, useful for marketplaces or custom integrations
An MCP server allows natural language setup of agents
Q&A: Technical Implementation 28:47
Language switching relies on ASR confidence scores and user-configured voices per language; accents can be matched to locale (e.g., Chennai Tamil)
System tools automatically route the conversation to the correct voice and language based on detected speech
Q&A: Actions, Tools, and Function Calling 31:00
Agents support server-side tools through standard LLM function-calling (e.g., checking appointments via webhooks)
Integration with third-party APIs (e.g., cal.com) enables scheduling and other complex tasks from within conversations
Q&A: Low Latency and Model Selection 33:01
Model choice affects latency and price; more powerful models (like GPT-4) are slower
For lowest latency, use lightweight LLMs (e.g., Gemini Flash) and ElevenLabs’ speech “flash” models
Best model depends on use case and should be tested for performance vs. latency
Q&A: Pricing and Long Interactions 35:24
Pricing is per call minute, with included minutes per subscription tier and overage charges
Long-session or companion-type applications may be costly; custom pricing may be available upon request
Q&A: Multi-Tasking & Multi-Agent Orchestration 37:16
Agents can be organized for specific tasks with agent-to-agent transfers handled as system tools
Different LLMs can be used for different agents/tasks; user experience can remain seamless if the voice stays the same
Q&A: Enterprise Latency & Conversation Management 39:03
For high-latency enterprise use cases (e.g., slow database lookups), “filler” phrases are built-in so agents can inform users of delays
Tools can be configured with timeouts; the agent can keep the conversation natural during slow operations
Asynchronous responses may be injectible with sockets, but further investigation may be needed for branching conversations
Q&A: Handling Mixed-Language Inputs 43:58
Mixing two languages generally works, but recognizing three or more in the same input reduces accuracy
Language learning or mixed-language apps may need specific prompt engineering or consider sound-to-sound models
There is some support, but performance degrades as complexity increases
Q&A: Fraud, Safety & Moderation 49:00
ElevenLabs implements safety features such as live moderation, custom “do not say” lists for voice models, and watermarking of all generated speech
Can trace generated speech back to accounts, support reporting and moderation
Voice cloning requires user verification with random sentence reading
Q&A: Avatars and Downstream Integration 55:07
ElevenLabs currently focuses on voice but partners with avatar solution providers
Specific integration experiences with Nvidia’s avatar stack discussed; no immediate plans to expand further down the stack, but open to working with partners
Q&A: Custom Vocabulary and Pronunciation 57:47
Pronunciation dictionaries can be uploaded to control how generated speech pronounces acronyms or custom terms
Speech-to-text model lacks fine-tuning for acronyms, but LLM prompt engineering may help to normalize transcript
Example issue: "SAP" pronounced as a word instead of an acronym can be fixed for text-to-speech, but not fully for ASR yet
Workshop Wrap-Up 61:10
Attendees invited to connect, access credits, and visit the expo booth for further questions or demonstrations
The workshop concludes with thanks to the audience