Full Workshop: Realtime Voice AI — Mark Backman, Daily

Introduction and Workshop Overview 00:03

  • Mark from Daily introduces himself and the team, outlining a hands-on workshop focused on building real-time voice bots.
  • Participants are urged to check their Wi-Fi connectivity, as real-time audio streaming requires a strong connection.
  • The goal is to build a voice bot within the session using Pipecat, an open-source Python framework for voice and multimodal AI agents built by Daily.

Understanding Real-Time Voice AI and Pipecat 01:42

  • Voice AI is challenging due to human expectations shaped by evolution and communication habits.
  • High performance is expected in listening, conversational ability, natural-sounding voice output, and speed (ideally under 800ms end-to-end).
  • Pipecat enables a modular, orchestration-based approach—components like speech-to-text, LLM, and text-to-speech can be easily swapped or run in parallel.

The Pipecat Pipeline Architecture 03:35

  • Pipecat's architecture uses a multimedia pipeline: input (audio/video), processing (transcription, LLM inference, TTS), and output.
  • Users can choose different providers (e.g., Google, OpenAI, Deepgram) for each stage.
  • Pipecat supports parallel pipelines for redundancy or specialized processing paths (e.g., failover between vendors within a session).
  • Modern models, like Gemini Live, combine transcription, LLM, and TTS into a single step, simplifying pipelines.

Building with Pipecat: Code Walkthrough 10:30

  • The workshop works from a public GitHub repo (daily-co/daily.comini-pipecat-workshop).
  • The main bot code encapsulates the workflow in an async HTTP session.
  • Daily provides the audio transport layer; context aggregation collects conversational history for LLM input.
  • Gemini Live is used as the primary LLM; tools can be attached for function calls (e.g., weather, restaurants).
  • Pipecat uses a universal function schema enabling switching between LLM providers without code changes.

Modularity, Transport Options, and Production Considerations 16:25

  • WebRTC is recommended for client-server audio transport due to quality and error correction; Websockets are suitable for server-server or phone bots.
  • Pipecat integrates with phone carriers (e.g., Twilio) via Websockets, PSTN, or SIP, providing flexible telephony support.
  • The framework is modular: developers can "plug and play" speech-to-text, LLM, and TTS services.

Handling Voice Activity Detection (VAD) and Latency 17:41

  • VAD is critical for detecting when a user starts/stops speaking, controlling turn-taking in conversation.
  • Recommended open-source VAD is Solero, which runs locally and with low CPU usage.
  • TTS tokens are the highest cost component in bot processing.
  • Running key components locally (e.g., STT, VAD, TTS) can reduce latency and reliance on external APIs.

Guardrails, Prompting, and Context Management 24:24

  • Pipecat does not enforce LLM behavioral guardrails; developers should manage prompts and context.
  • Flexible pipelines allow for intermediary checks or real-time evaluation of LLM outputs.
  • For task-oriented bots, chunking tasks and tightly managing the context window enhances reliability.
  • Summarization via out-of-band LLM calls can help manage long conversations and token limits.

Handling Large Contexts, State, and Accuracy 31:56

  • Large context windows slow down LLMs; chunking and task segmentation improve speed and accuracy.
  • For real-time bots, streaming audio/text enables faster responses (e.g., Gemini Live < 500ms for first token).
  • Tool (function) calls require complete JSON responses, causing higher latency compared to streamed responses.
  • Extensive context can increase confusion for the LLM, requiring careful structuring of inputs.

Practical Setup: Live Coding Demo 40:19

  • Walkthrough demonstrates setting up a Python virtual environment, installing project requirements, and configuring environment variables for API keys.
  • Step-by-step coding shows creating transports, VAD, LLM service, system instructions, and the Pipecat pipeline.
  • Simple configuration allows bot to join a Daily room and interact in real-time.
  • The framework supports both server-side and client SDKs (Android, iOS, JS, React).

Noise Cancellation, Synchronization, and Client Integration 38:07

  • Noise cancellation for challenging environments can be handled via third parties (e.g., Crisp) at the transport input layer.
  • Pipecat provides synchronization between TTS audio output and corresponding text (word-level alignment), supporting rich client experiences.
  • Client SDKs are available for common platforms, supporting WebRTC and other transports.

Evaluation, Testing, and Use Cases 57:31

  • Automated end-to-end testing can be set up by having two bots (e.g., eval bot and conversational bot) interact and validate results.
  • Pipecat is used in production by large companies and handles hundreds of thousands of calls daily.
  • Sample projects include interactive games (e.g., Word Wrangler) and phone-based bots, showcasing complex use cases with multiple AI agents in parallel pipelines.

Open-Source, Community, and Future Directions 65:06

  • Fully offline bots are possible with local models, but state-of-the-art requires access to cloud or on-prem LLMs/TTS/STT.
  • Input transcription (STT) remains a fundamental challenge for accuracy.
  • Ongoing work addresses conversation flow, semantic turn-ending detection, and natural interruption handling.
  • Developers can join the project's Discord for support and community involvement.
  • Example demo project: Word Wrangler game using parallel Gemini agents for interactive gameplay.
  • Workshop concludes with invitation for further questions and encourages exploration of sample projects.