SUMM

The speakers introduce themselves as part of the Nvidia team that developed the GR00T N1 robotics foundation model.
They dispel the myth that AI will eliminate jobs, citing a McKinsey report showing more open jobs than people in advanced economies, especially in industries like leisure, healthcare, construction, and manufacturing.
These industries require physical AI, as current language-based AIs like ChatGPT can't address tasks that involve operating in the physical world.
The need for humanoid robots is justified by the fact that human-centric environments are optimized for human forms, making it practical to use robots designed like humans for varied tasks.
Specialist robots can excel at specific functions but lack general utility in diverse environments.

The physical AI lifecycle includes three main stages: data generation/collection, data consumption through training, and deployment on edge devices.
Nvidia describes this as the "three computer problem," each stage requiring different compute resources: powerful simulation hardware, dedicated training infrastructure, and efficient edge devices.
Project GR00T is Nvidia’s end-to-end initiative, encompassing hardware, software, and research, focused in this talk on the foundation model.
The GR00T N1 foundation model is open source, highly customizable, and designed to be "cross embodiment," able to adapt to different robot forms with a base model of two billion parameters.

Acquiring data for robotics models is challenging due to the lack of internet-scale datasets for real-world robot actions.
The "data pyramid" consists of:
- Top: Real-world robot data, collected mostly via human teleoperation, which is expensive and scarce.
- Bottom: Abundant internet video data, mostly of humans performing tasks, often unstructured and less relevant for robots.
- Middle: Synthetic data generated through simulation; potentially unlimited, but creating high-quality simulations is resource-intensive.
Advanced techniques multiply limited real-world data using video generation models, and active research aims to blend simulation with real-world data efficiently.

Example inputs to the model include image observations, robot state, and a language prompt, with outputs being robot action trajectories (continuous motions).
The robot internally processes floating-point vectors that control joint movements, not the visualized hand motions humans perceive.
The model architecture introduces a "dual system" inspired by Daniel Kahneman's "Thinking, Fast and Slow":
- System Two (the "brain"): Breaks complex tasks into sub-tasks.
- System One: Rapidly executes the sub-tasks, operating at high frequency (about 120 Hz).
Inputs are processed through encoders (state and action), creating tokens for further processing.
Image and text inputs are encoded through a vision encoder and text tokenizer, standardized via a vision-language model (VLM), then integrated with action/state tokens via a diffusion transformer block with attention mechanisms.
An embodiment-specific action decoder translates output tokens into executable motion for specific robot forms, enabling cross-embodiment generalization.

Two main robot training methods:
- Imitation Learning: Robots learn by copying human experts, aiming to minimize the difference from expert demonstrations. This is constrained by the limited and expensive expert data.
- Reinforcement Learning: Robots learn via trial and error to maximize rewards, not dependent on expert data but facing challenges transitioning from simulation to real-world application ("sim-to-real gap").
GR00T N1 leverages both approaches in its training.

The trained model can perform varied tasks such as kitchen pick-and-place, simulated "romantic" gestures, and collaborative industrial tasks.
Its generalist design allows adaptation to a wide variety of tasks and environments, not limited to a fixed set of operations.

Three fundamental principles:
- The data pyramid: Overcoming the challenge of limited real-world action data compared to text-based AI data.
- Dual system architecture: Co-training perception/planning and execution modules ensures better integration and performance.
- Generalist model: The foundation model supports cross-embodiment and broad downstream task adaptation, similar to how LLMs can be fine-tuned for varied applications.
The GR00T N1 foundation model is positioned as a general-purpose robotics brain adaptable to diverse embodiments and functions.

What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha