Early iterations of DeepMind's VO model required significant prompt guidance and produced mostly visual-only videos lacking high sound quality.
Newer V3 models offer visually richer, more photorealistic, and cinematically styled outputs, now including sound that closely matches video content.
Prompt rewriting is introduced, making it easier for users to generate detailed and effective prompts with minimal expertise.
Publicly available VO clips are limited to around 8 seconds for easy experimentation and creative control; internal versions allow longer clips.
Integration of sound, more realistic physics (lighting, gravity), and improved character consistency distinguish V3 from earlier models.
Demonstrating Multimodality and Creative Capacity 10:18
Users can ground video generation in a starting image for more control, using Imagine 4 to generate initial frames.
Gemini models enable precise editing, letting users modify specifics like number of people, style, lighting in imagery, and even produce sound/music alongside video.
Multimodality allows handling text, code, images, audio, and video within the same model, reflecting how humans experience the world through multiple senses.
Gemini can produce steerable audio (e.g., adjusting volume, language, or emotion), giving greater creative flexibility.
Flow, developed by Google Labs, is a platform tailored for filmmakers, allowing advanced video creation—stitching together clips, camera controls, and styling options.
Flow offers greater creative and cinematic control than the standard Gemini app, supporting specialized roles and workflows for professionals.
Character consistency is maintained across scenes and lighting conditions, facilitating more complex storytelling.
There are safeguards in place for misuse, including AI-generated watermarks, limitations on generation of certain subjects (e.g., children, public figures), and safety filters to prevent malicious or misleading use.
Launch of Gemini's text-to-speech API enables expressive, steerable audio generation in multiple languages.
Evolution from WaveNet, which produced single-task, less flexible voices, to Gemini, which can dynamically generate audio with emotion, style, and language variation.
Users can prompt Gemini to produce speech in various tones (friendly, romantic, angry, grieving) and languages, demonstrating nuanced control in real time.
AI Studio provides an accessible user interface to create and experiment with these audio capabilities; users can view the model’s thought process as it generates outputs.
Developers can instantly access SDK code for generated audio, streamlining integration into personal projects.
Gemini Live enables multimodal, real-time AI assistance—integrated into AI Studio and available on Android devices.
It can “see” and interpret what's on screen or through a webcam, provide empathetic explanations, and integrate with tools like Google Calendar, Docs, and Gmail.
Example use cases include describing onscreen content, analyzing code within a Google Colab notebook, and responding to questions with up-to-date information from Google Search.
AI Studio’s “build apps with Gemini” feature lets users—even those without coding experience—generate robust application code simply by describing project requirements in natural language.
Code produced is optimized for the latest SDKs, highly robust, and self-healing (errors are detected and resolved automatically).
Features include AI-generated images (via Imagine) and the ability to embed videos, summaries, and external data in apps or websites.
Apps can be deployed using Google Cloud's Cloud Run, providing a secure, scalable, and shareable web address that protects API keys.
These tools free developers from tedious maintenance and enable greater focus on product experience, building, and ambitious systems.
The accessibility of AI-driven app and content creation opens doors for non-engineers and experts alike, encouraging wider participation and innovation.
Anticipation of a transformative impact on how people collaborate, create, and share across disciplines—enabling scientists, historians, musicians, and more to digitize and distribute their ideas.
Integration of audio, video, and language tools into seamless platforms marks a new era for creative AI.
The episode highlights the excitement around new capabilities, the promise of democratized creativity, and the potential for anyone to bring ideas to life rapidly using these AI tools.