Grok Model Progression and Reinforcement Learning 00:00
Grok 4 is described as the smartest current AI model, representing a significant leap from previous frontier models.
Initial versions like Grok 2 focused on next token prediction, with increased compute for Grok 3, and further enhancement using reinforcement learning for Grok 3 reasoning.
Grok 4’s main advancement is reinforcement learning with verifiable rewards, where problems with known solutions train the model by rewarding correct answers, allowing for sophisticated “thinking” behaviors.
The approach faced a challenge in finding enough real-world problems with verifiable solutions, highlighting a limitation in synthetic benchmarking.
Elon Musk suggested that real-world testing, such as through robotics, could provide virtually unlimited verifiable rewards.
Benchmark Performance and Multi-Agent Approach 02:37
On the difficult “humanity’s last exam” benchmark (covering multiple domains: math, physics, biology, etc.), Grok 4 scored 26.9% without tools, surpassing other models like Gemini 2.5 Pro (21.6%).
Scaling up test time compute and using “Grok 4 Heavy” (a multi-agent version where several agents collaborate), the score further increased to 50.7%, doubling the next best model’s score.
The multi-agent system involves parallel agents collaborating and sharing solutions to improve accuracy.
Demonstrations included Grok 4 Heavy spawning multiple agents to solve extremely complex math problems, illustrating the collaborative setup.
Live Demonstrations and Real-Time Capabilities 08:27
Grok 4 was shown predicting World Series outcomes by accessing betting odds, calculating probabilities, and presenting a traceable thought process; the Dodgers were given a 21.6% chance.
The model generated visualizations, such as two black holes colliding, explicitly stating simplifications and approximations made in the output.
Grok 4 demonstrated real-time information gathering by creating a timeline of model score announcements, extracting event data from web sources.
Grok 4 outperformed competitors on “Vending Bench,” managing a vending machine in a simulated real-world scenario, ending with a net worth of $4,700 (versus $2,000 for Claude Opus 4, $1,800 for 03, $789 for Gemini 2.5 Pro, and a human at $844).
Grok 4 was given to a game developer, who used it to create a first-person shooter in four hours.
The model excelled in sourcing game assets and textures, streamlining a traditionally time-consuming part of game development.
While capable of producing impressive demos, the speaker notes that fully AI-generated AAA games are not imminent, with human creativity and taste still essential.
Grok 4 is based on foundation model version 6; version 7 training should be completed by the end of the month, promising even better multimodal abilities.
Planned releases include a coding-specific model in August, a multimodal agent in September, and a video generation model in October.
The speaker will continue to test and report on Grok 4’s capabilities in future videos.