SUMM

General purpose robotics is hindered by the need to build specialized companies for each application (e.g., logistics, kitchen robots, wet lab automation).
Each application usually requires new hardware, custom software, and task-specific solutions, which hampers widespread success.
The speaker co-founded Physical Intelligence to address this by enabling any robot to perform any task in any environment, focusing on foundation models for robotics akin to those used in language and coding.
Scale is an important factor in model development, but simply using large datasets (from industry, YouTube, or simulation) is not sufficient, as each source lacks comprehensive task diversity or realism.

Real-world robot data was collected, starting with tasks like lighting a candle, to train generalist models.
A key challenge is teaching robots complex, dexterous, long-horizon tasks, such as unloading dryers and folding laundry.
Initial efforts started simple (folding a single brand and size of shirt), gradually increasing task complexity (folding crumpled shirts, varying clothing sizes and types).
Data was collected using teleoperation and policies were trained via imitation learning, mapping robot camera images to joint positions.

Early models struggled with generalization, especially with increased task complexity, often showing 0% success rates.
The team experimented with variables such as model memory, control strategies, image resolution, and more.
Progress improved significantly when adopting a pre-training and fine-tuning recipe: models were first pre-trained on broad data, then fine-tuned on highly curated, high-quality demonstrations.
This technique enabled robots to reliably fold five clothing items in sequence, although still imperfect (e.g., taking 20 minutes, struggling with stacks).

Transitioned to using larger, open-source vision-language models, with up to 3 billion parameters for increased performance and generalization.
These models, pre-trained on all available robot data and fine-tuned with curated datasets, improved success rates, speed (down to 9 minutes for five items), and demonstrated the ability to handle unseen clothing types.
The general approach was transferable: the same training recipe enabled robots to clean tables, construct boxes, light candles, and was successfully adapted to entirely new robot platforms with different control schemes.

Noted limitations in that robots were primarily tested in the environments where they were trained.
Addressed this by collecting diverse data from over 100 unique rooms (bedrooms, kitchens, homes across San Francisco), training on both mobile and static robot data sources.
Fine-tuned models achieved significantly better performance in unseen environments, including Airbnbs, by leveraging the diversity and scale of training data.

Early models sometimes ignored language instructions, opting instead for their own strategies.
Improved language following by preventing the randomly initialized action head from degrading pre-trained vision-language model performance, using a tokenization approach and gradient stopping.
Achieved 80% instruction following accuracy compared to 20% previously.

Hierarchical models allow high-level prompts (e.g., "make me a sandwich") to be decomposed into low-level actions.
Addressed the challenge of limited real-world human-robot interaction data by synthetically generating prompts using large language models.
This enables robots to follow a wide range of prompts, including nuanced requests (e.g., dietary preferences) and real-time interjections or corrections.
Compared to using frontier language models as planners, the trained system outperformed these models on instruction-following and relevant task progress.

The combination of pre-training on broad data and fine-tuning on curated examples enables robots to perform complex, multiphase tasks and generalize to new environments and instructions.
Real-world data is essential for robustness; greater data scale and diversity increase generalization, but reliability and performance are current bottlenecks.
Failure modes include incomplete task execution, misidentification of task targets, speed limitations, and long-term planning challenges.
Further research and open-source collaboration are needed before robots can be truly generalized and ready for the open world.

High-quality post-training data should be consistent and show reliable, efficient task completion strategies.
Reinforcement learning (RL), especially online RL from real robot data, is expected to improve post-training success rates and task efficiency beyond imitation learning alone.
Scaling model sizes generally increases performance, but there's ongoing research into combining smaller models with external knowledge bases, though the division of labor is challenging.
Synthetic data can be useful, particularly in evaluation and possibly in generating instructive or corrective experiences, but cannot fully replace real robot data for robust, generalizable models.
Robotics research in academia versus industry presents different resource constraints; academia often has less data/computing throughput, but both offer unique opportunities for impactful work.

Significant opportunities exist in developing better robotic infrastructure (real-time systems, training platforms) and open-source data/model contributions.
More research is needed on integrating world modeling (predicting future states) with action policies and ensuring infrastructure supports the multimodal and timing requirements unique to robotics.

Both academic and industry settings have roles in advancing robotic physical intelligence; resource availability varies but both face limitations.
The gap between academia and industry resources is not as large as often perceived; over-resourcing can sometimes lead to inefficiency.
Improvements in VLM-based (vision-language model) architectures are needed to better handle physical actions; tokenization techniques are one approach to bridging this gap.
The speaker encourages reviewing their group's recent work and papers for more technical details.

Chelsea Finn: Building Robots That Can Do Anything