SUMM

Mark from Daily introduces himself and the team, outlining a hands-on workshop focused on building real-time voice bots.
Participants are urged to check their Wi-Fi connectivity, as real-time audio streaming requires a strong connection.
The goal is to build a voice bot within the session using Pipecat, an open-source Python framework for voice and multimodal AI agents built by Daily.

Voice AI is challenging due to human expectations shaped by evolution and communication habits.
High performance is expected in listening, conversational ability, natural-sounding voice output, and speed (ideally under 800ms end-to-end).
Pipecat enables a modular, orchestration-based approach—components like speech-to-text, LLM, and text-to-speech can be easily swapped or run in parallel.

Pipecat's architecture uses a multimedia pipeline: input (audio/video), processing (transcription, LLM inference, TTS), and output.
Users can choose different providers (e.g., Google, OpenAI, Deepgram) for each stage.
Pipecat supports parallel pipelines for redundancy or specialized processing paths (e.g., failover between vendors within a session).
Modern models, like Gemini Live, combine transcription, LLM, and TTS into a single step, simplifying pipelines.

The workshop works from a public GitHub repo (daily-co/daily.comini-pipecat-workshop).
The main bot code encapsulates the workflow in an async HTTP session.
Daily provides the audio transport layer; context aggregation collects conversational history for LLM input.
Gemini Live is used as the primary LLM; tools can be attached for function calls (e.g., weather, restaurants).
Pipecat uses a universal function schema enabling switching between LLM providers without code changes.

WebRTC is recommended for client-server audio transport due to quality and error correction; Websockets are suitable for server-server or phone bots.
Pipecat integrates with phone carriers (e.g., Twilio) via Websockets, PSTN, or SIP, providing flexible telephony support.
The framework is modular: developers can "plug and play" speech-to-text, LLM, and TTS services.

VAD is critical for detecting when a user starts/stops speaking, controlling turn-taking in conversation.
Recommended open-source VAD is Solero, which runs locally and with low CPU usage.
TTS tokens are the highest cost component in bot processing.
Running key components locally (e.g., STT, VAD, TTS) can reduce latency and reliance on external APIs.

Pipecat does not enforce LLM behavioral guardrails; developers should manage prompts and context.
Flexible pipelines allow for intermediary checks or real-time evaluation of LLM outputs.
For task-oriented bots, chunking tasks and tightly managing the context window enhances reliability.
Summarization via out-of-band LLM calls can help manage long conversations and token limits.

Large context windows slow down LLMs; chunking and task segmentation improve speed and accuracy.
For real-time bots, streaming audio/text enables faster responses (e.g., Gemini Live < 500ms for first token).
Tool (function) calls require complete JSON responses, causing higher latency compared to streamed responses.
Extensive context can increase confusion for the LLM, requiring careful structuring of inputs.

Walkthrough demonstrates setting up a Python virtual environment, installing project requirements, and configuring environment variables for API keys.
Step-by-step coding shows creating transports, VAD, LLM service, system instructions, and the Pipecat pipeline.
Simple configuration allows bot to join a Daily room and interact in real-time.
The framework supports both server-side and client SDKs (Android, iOS, JS, React).

Noise cancellation for challenging environments can be handled via third parties (e.g., Crisp) at the transport input layer.
Pipecat provides synchronization between TTS audio output and corresponding text (word-level alignment), supporting rich client experiences.
Client SDKs are available for common platforms, supporting WebRTC and other transports.

Automated end-to-end testing can be set up by having two bots (e.g., eval bot and conversational bot) interact and validate results.
Pipecat is used in production by large companies and handles hundreds of thousands of calls daily.
Sample projects include interactive games (e.g., Word Wrangler) and phone-based bots, showcasing complex use cases with multiple AI agents in parallel pipelines.

Fully offline bots are possible with local models, but state-of-the-art requires access to cloud or on-prem LLMs/TTS/STT.
Input transcription (STT) remains a fundamental challenge for accuracy.
Ongoing work addresses conversation flow, semantic turn-ending detection, and natural interruption handling.
Developers can join the project's Discord for support and community involvement.
Example demo project: Word Wrangler game using parallel Gemini agents for interactive gameplay.
Workshop concludes with invitation for further questions and encourages exploration of sample projects.

Full Workshop: Realtime Voice AI — Mark Backman, Daily