Mark from Daily introduces himself and the team, outlining a hands-on workshop focused on building real-time voice bots.
Participants are urged to check their Wi-Fi connectivity, as real-time audio streaming requires a strong connection.
The goal is to build a voice bot within the session using Pipecat, an open-source Python framework for voice and multimodal AI agents built by Daily.
Understanding Real-Time Voice AI and Pipecat 01:42
Voice AI is challenging due to human expectations shaped by evolution and communication habits.
High performance is expected in listening, conversational ability, natural-sounding voice output, and speed (ideally under 800ms end-to-end).
Pipecat enables a modular, orchestration-based approach—components like speech-to-text, LLM, and text-to-speech can be easily swapped or run in parallel.
Automated end-to-end testing can be set up by having two bots (e.g., eval bot and conversational bot) interact and validate results.
Pipecat is used in production by large companies and handles hundreds of thousands of calls daily.
Sample projects include interactive games (e.g., Word Wrangler) and phone-based bots, showcasing complex use cases with multiple AI agents in parallel pipelines.
Open-Source, Community, and Future Directions 65:06
Fully offline bots are possible with local models, but state-of-the-art requires access to cloud or on-prem LLMs/TTS/STT.
Input transcription (STT) remains a fundamental challenge for accuracy.
Ongoing work addresses conversation flow, semantic turn-ending detection, and natural interruption handling.
Developers can join the project's Discord for support and community involvement.
Example demo project: Word Wrangler game using parallel Gemini agents for interactive gameplay.
Workshop concludes with invitation for further questions and encourages exploration of sample projects.