SUMM

Grok 4 is described as the smartest current AI model, representing a significant leap from previous frontier models.
Initial versions like Grok 2 focused on next token prediction, with increased compute for Grok 3, and further enhancement using reinforcement learning for Grok 3 reasoning.
Grok 4’s main advancement is reinforcement learning with verifiable rewards, where problems with known solutions train the model by rewarding correct answers, allowing for sophisticated “thinking” behaviors.
The approach faced a challenge in finding enough real-world problems with verifiable solutions, highlighting a limitation in synthetic benchmarking.
Elon Musk suggested that real-world testing, such as through robotics, could provide virtually unlimited verifiable rewards.

On the difficult “humanity’s last exam” benchmark (covering multiple domains: math, physics, biology, etc.), Grok 4 scored 26.9% without tools, surpassing other models like Gemini 2.5 Pro (21.6%).
With added tool usage (web browsing, code execution), Grok 4 reached 41%.
Scaling up test time compute and using “Grok 4 Heavy” (a multi-agent version where several agents collaborate), the score further increased to 50.7%, doubling the next best model’s score.
The multi-agent system involves parallel agents collaborating and sharing solutions to improve accuracy.
Demonstrations included Grok 4 Heavy spawning multiple agents to solve extremely complex math problems, illustrating the collaborative setup.

Grok 4 was shown predicting World Series outcomes by accessing betting odds, calculating probabilities, and presenting a traceable thought process; the Dodgers were given a 21.6% chance.
The model generated visualizations, such as two black holes colliding, explicitly stating simplifications and approximations made in the output.
Grok 4 demonstrated real-time information gathering by creating a timeline of model score announcements, extracting event data from web sources.

On the GPQA benchmark, Grok 4 Heavy scored 88.9%, slightly outperforming the previous leader at 86%.
Grok 4 Heavy achieved a perfect 100% on Amy 2025, tackling some of the world’s hardest math questions.
Competed strongly on coding, with a Live Codebench score of 79.4%; Math Arena score reached 96.7%.
On the ARC AGI test, designed for generalization and pattern recognition, Grok 4 led with 66.6%, outperforming other major models significantly.

Grok 4 outperformed competitors on “Vending Bench,” managing a vending machine in a simulated real-world scenario, ending with a net worth of $4,700 (versus $2,000 for Claude Opus 4, $1,800 for 03, $789 for Gemini 2.5 Pro, and a human at $844).

Grok 4 was given to a game developer, who used it to create a first-person shooter in four hours.
The model excelled in sourcing game assets and textures, streamlining a traditionally time-consuming part of game development.
While capable of producing impressive demos, the speaker notes that fully AI-generated AAA games are not imminent, with human creativity and taste still essential.

Grok 4 is available via API, supporting a 256k context window, multimodal reasoning, real-time data search, and enterprise-grade security.
Pricing: Super Grok at $30/month; Super Grok Heavy at $300/month or $3,000/year, including higher rate limits and early features.

Grok 4 is based on foundation model version 6; version 7 training should be completed by the end of the month, promising even better multimodal abilities.
Planned releases include a coding-specific model in August, a multimodal agent in September, and a video generation model in October.
The speaker will continue to test and report on Grok 4’s capabilities in future videos.

Grok 4 is really smart... Like REALLY SMART