Grok 4 is really smart... Like REALLY SMART

Grok Model Progression and Reinforcement Learning 00:00

  • Grok 4 is described as the smartest current AI model, representing a significant leap from previous frontier models.
  • Initial versions like Grok 2 focused on next token prediction, with increased compute for Grok 3, and further enhancement using reinforcement learning for Grok 3 reasoning.
  • Grok 4’s main advancement is reinforcement learning with verifiable rewards, where problems with known solutions train the model by rewarding correct answers, allowing for sophisticated “thinking” behaviors.
  • The approach faced a challenge in finding enough real-world problems with verifiable solutions, highlighting a limitation in synthetic benchmarking.
  • Elon Musk suggested that real-world testing, such as through robotics, could provide virtually unlimited verifiable rewards.

Benchmark Performance and Multi-Agent Approach 02:37

  • On the difficult “humanity’s last exam” benchmark (covering multiple domains: math, physics, biology, etc.), Grok 4 scored 26.9% without tools, surpassing other models like Gemini 2.5 Pro (21.6%).
  • With added tool usage (web browsing, code execution), Grok 4 reached 41%.
  • Scaling up test time compute and using “Grok 4 Heavy” (a multi-agent version where several agents collaborate), the score further increased to 50.7%, doubling the next best model’s score.
  • The multi-agent system involves parallel agents collaborating and sharing solutions to improve accuracy.
  • Demonstrations included Grok 4 Heavy spawning multiple agents to solve extremely complex math problems, illustrating the collaborative setup.

Live Demonstrations and Real-Time Capabilities 08:27

  • Grok 4 was shown predicting World Series outcomes by accessing betting odds, calculating probabilities, and presenting a traceable thought process; the Dodgers were given a 21.6% chance.
  • The model generated visualizations, such as two black holes colliding, explicitly stating simplifications and approximations made in the output.
  • Grok 4 demonstrated real-time information gathering by creating a timeline of model score announcements, extracting event data from web sources.

Additional Benchmark Comparisons 12:18

  • On the GPQA benchmark, Grok 4 Heavy scored 88.9%, slightly outperforming the previous leader at 86%.
  • Grok 4 Heavy achieved a perfect 100% on Amy 2025, tackling some of the world’s hardest math questions.
  • Competed strongly on coding, with a Live Codebench score of 79.4%; Math Arena score reached 96.7%.
  • On the ARC AGI test, designed for generalization and pattern recognition, Grok 4 led with 66.6%, outperforming other major models significantly.

Real-World Application (Vending Bench) 15:02

  • Grok 4 outperformed competitors on “Vending Bench,” managing a vending machine in a simulated real-world scenario, ending with a net worth of $4,700 (versus $2,000 for Claude Opus 4, $1,800 for 03, $789 for Gemini 2.5 Pro, and a human at $844).

AI in Game Development 15:46

  • Grok 4 was given to a game developer, who used it to create a first-person shooter in four hours.
  • The model excelled in sourcing game assets and textures, streamlining a traditionally time-consuming part of game development.
  • While capable of producing impressive demos, the speaker notes that fully AI-generated AAA games are not imminent, with human creativity and taste still essential.

Availability, Features, and Pricing 18:00

  • Grok 4 is available via API, supporting a 256k context window, multimodal reasoning, real-time data search, and enterprise-grade security.
  • Pricing: Super Grok at $30/month; Super Grok Heavy at $300/month or $3,000/year, including higher rate limits and early features.

Roadmap and Future Developments 19:02

  • Grok 4 is based on foundation model version 6; version 7 training should be completed by the end of the month, promising even better multimodal abilities.
  • Planned releases include a coding-specific model in August, a multimodal agent in September, and a video generation model in October.
  • The speaker will continue to test and report on Grok 4’s capabilities in future videos.