Voice is considered the most natural human interface, enabling storytelling, conversation, and emotional expression.
Voice AI is viewed as a critical and universal building block for the next generation of generative AI, particularly at the UI level.
Voice agents are already deployed at scale in various applications, including language translation, directed learning, speech therapy, and enterprise co-pilots.
It is common for users not to realize they are interacting with a voice agent, even when informed, highlighting the naturalness of the interface.
The voice AI stack consists of large language models (LLMs), real-time APIs (like Google's Gemini Live API), orchestration libraries and frameworks (like Pipecat), and application code.
The current state of voice AI is considered early, with no aspect more than 50% solved, indicating significant work is still needed across all parts of the stack.
As the technology matures, capabilities tend to move down the stack, from individual application code to orchestration libraries and eventually into core APIs and models.
Turn detection is given as an example of this evolution, having moved from application-level implementation to frameworks, then APIs, and is expected to be handled by models over time.
Live Demo: Voice Assistant for Task Management 06:57
A demonstration showcased a personal voice AI assistant used for managing priorities and various lists (grocery, reading, work tasks).
The assistant successfully created a grocery list for "asparagus pizza," adding ingredients like pizza crust, mozzarella cheese, tomato sauce, asparagus, garlic, and olive oil.
During the demo, the assistant struggled to correctly add "Dream Count" to a reading list and find its author, repeatedly misidentifying it as "Quick" or "Segmentation fault."
The assistant successfully created a "work tasks" list, adding items such as "create H2 roadmap by end of day Friday" and "finish writing podcast script by end of day Thursday, June 5th, 2025."
It also performed complex list manipulations, combining grocery, reading, and work lists and splitting them into "personal tasks" and "work tasks."
The demo concluded with the assistant generating a dynamic app display featuring "hello world" text in Google colors and animated neon green ASCII cats, demonstrating multimodal capabilities.
The demo highlighted the "jagged frontier" of AI's ability to intuit user intent, especially with minimal explicit instructions in the code.
Model behavior can be variable, sometimes leading to unexpected, yet potentially beneficial, outcomes.
The system heavily relies on the LLM's intelligence for contextual understanding of concepts like "lists" and their content.
An analogy was drawn between grandmothers tying knots or strings to remember tasks and modern voice AI, emphasizing how technology evolves human capabilities beyond simple memory aids.
The speakers expressed belief that voice is the most natural interface and that most future interactions with language models will occur via voice.
Gemini models are trained to be multimodal from the ground up, capable of ingesting text, voice, images, and video.