Grok 4 is HERE! and it's the best? (Livestream Reaction)

Introduction & Context 00:00

  • The livestream for Grok 4 was delayed by over an hour and the host emphasizes the significance and anticipation for this release
  • The host plans to make multiple future videos testing Grok 4
  • Grok 4 is claimed to achieve perfect SAT scores and near-perfect results on GRE and graduate-level exams across diverse disciplines, even with previously unseen questions

Grok 4's Intelligence & Training Approach 00:27

  • Grok 4 is described as having generalization abilities across all academic fields, outperforming almost all graduate students simultaneously
  • Moving from Grok 2 to Grok 3 to Grok 4, training compute has increased by about 100 times at each step
  • Grok 3 focused on scaling pre-training; Grok 4 scales up reinforcement learning and post-training even more
  • Large computational resources are used: 200,000 GPUs for reinforcement learning, with a focus on verifiable reward outcomes
  • The creation of Colossus, a supercomputer with 100,000 H100 GPUs, facilitated vast increases in training compute

Benchmark Performance & Academic Capabilities 04:33

  • Grok 4 is tested on the "Humanity's Last Exam" (HLE), a set of 2,500 challenging, PhD-level problems across mathematics, natural sciences, engineering, and humanities
  • Prior models scored only single-digit accuracies; Grok 4 achieves much higher
  • Grok 4 is said to perform at PhD level in every academic subject, although it hasn't yet discovered new physics or invented new technology—a milestone predicted to be reached within one or two years

Tool Use & Reasoning Expansion 09:45

  • Addition of native tool-use (like web search, memory) during Grok 4's training significantly improved its capability to use tools
  • Tool use is considered primitive compared to advanced physical simulation tools used at companies like Tesla or SpaceX, but will be integrated later in the year
  • Future plans involve providing Grok with accurate physics simulators and ability to interact with the real world via humanoid robots (like Optimus)
  • Emphasis on entering an "intelligence explosion" era

AI Safety & Values 12:12

  • Host echoes developers' focus on AI safety, truth-seeking, and instilling positive values, likened to raising a super-genius child
  • Ongoing concern about whether these values will persist as AI surpasses human intelligence

Data, Compute, and Reality as the Hardest Test 17:07

  • Scaling reinforcement learning is reaching a data bottleneck, as there are limited verifiable, challenging problems
  • Solutions involve generating novel problems and using reality as the ultimate benchmark—testing technologies or ideas directly in the physical world

Multi-Agent Systems & Test-Time Compute 19:39

  • Grok 4 Heavy version uses multiple agents in parallel to tackle problems, share insights, and select the best solution
  • This boosts performance, solving over 50% of the HLE problems, especially using parallel agents at inference/test time

Demonstrations: Practical Problem Solving 22:11

  • Grok 4 and Grok 4 Heavy are shown solving academic and real-world tasks (e.g., math problems, World Series prediction, searching through X posts, identifying weird photos)
  • Unique advantage in real-time data from the X (formerly Twitter) dataset, providing up-to-date and rich information not accessible to competitors
  • Model demonstrates ability to generate visualizations and simulations (e.g., black hole collisions) using accessible resources, though limited by browser-based computation

Evaluation on Benchmarks & Limitations 32:43

  • Grok 4 achieves high scores on major benchmarks like GPQA and the American Invitational Mathematics Exam (AIME 2025), obtaining a perfect AIME score
  • Outperforms previous top models by significant margins, particularly on reasoning benchmarks
  • Weakness noted in current version's image understanding and generation, but improvements planned with upcoming version 7 of the foundation model

Voice Features & User Experience 35:40

  • Introduction of new, high-quality voice modes ("Sal," "Eve") with improved latency and naturalness, sometimes demonstrated through creative tasks (e.g., singing operas about Diet Coke)
  • Voice interactions aim for calm and smooth conversational styles, competing with OpenAI's advanced voice mode
  • Voice model latency has been halved and user base has increased tenfold since launch

API Release & Real-World Automation 41:01

  • Grok 4 is available via API at launch, facilitating integration for developers into applications
  • Demonstrates leading performance on ARC AGI test (15.8% accuracy, double that of the next-best model)
  • Intelligence-per-dollar of Grok 4 is described as uniquely high

Real-World Use Cases & Business Automation 42:44

  • Grok 4 tested by Endon Labs on "Vending Bench," an AI simulation where it manages virtual vending machine businesses
  • Model achieved consistent, high performance: double the net worth of previous best models, maintaining strategies over long sessions
  • Early adopters in biomedical and financial sectors automate research flows (e.g., scanning experiment logs or analyzing financial data) using Grok 4

Gaming and Multimodal Capabilities 47:28

  • Game developers can use Grok 4 for sourcing assets and automating repetitive tasks in game creation
  • Plans to enhance video understanding, allowing the model to assess and play video games, slated for the upcoming version 7
  • AI-generated video games predicted for next year, with skepticism regarding AI's ability to judge subjective qualities like fun or humor

Roadmap & Closing Notes 50:32

  • Grok 4 and Grok 4 Heavy are available at release; coding model expected in August, multimodal agent in September, and video generation in October
  • Host plans deep-dive videos and testing in the near future, encouraging subscriptions for further content