SUMM

The livestream for Grok 4 was delayed by over an hour and the host emphasizes the significance and anticipation for this release
The host plans to make multiple future videos testing Grok 4
Grok 4 is claimed to achieve perfect SAT scores and near-perfect results on GRE and graduate-level exams across diverse disciplines, even with previously unseen questions

Grok 4 is described as having generalization abilities across all academic fields, outperforming almost all graduate students simultaneously
Moving from Grok 2 to Grok 3 to Grok 4, training compute has increased by about 100 times at each step
Grok 3 focused on scaling pre-training; Grok 4 scales up reinforcement learning and post-training even more
Large computational resources are used: 200,000 GPUs for reinforcement learning, with a focus on verifiable reward outcomes
The creation of Colossus, a supercomputer with 100,000 H100 GPUs, facilitated vast increases in training compute

Grok 4 is tested on the "Humanity's Last Exam" (HLE), a set of 2,500 challenging, PhD-level problems across mathematics, natural sciences, engineering, and humanities
Prior models scored only single-digit accuracies; Grok 4 achieves much higher
Grok 4 is said to perform at PhD level in every academic subject, although it hasn't yet discovered new physics or invented new technology—a milestone predicted to be reached within one or two years

Addition of native tool-use (like web search, memory) during Grok 4's training significantly improved its capability to use tools
Tool use is considered primitive compared to advanced physical simulation tools used at companies like Tesla or SpaceX, but will be integrated later in the year
Future plans involve providing Grok with accurate physics simulators and ability to interact with the real world via humanoid robots (like Optimus)
Emphasis on entering an "intelligence explosion" era

Host echoes developers' focus on AI safety, truth-seeking, and instilling positive values, likened to raising a super-genius child
Ongoing concern about whether these values will persist as AI surpasses human intelligence

Scaling reinforcement learning is reaching a data bottleneck, as there are limited verifiable, challenging problems
Solutions involve generating novel problems and using reality as the ultimate benchmark—testing technologies or ideas directly in the physical world

Grok 4 Heavy version uses multiple agents in parallel to tackle problems, share insights, and select the best solution
This boosts performance, solving over 50% of the HLE problems, especially using parallel agents at inference/test time

Grok 4 and Grok 4 Heavy are shown solving academic and real-world tasks (e.g., math problems, World Series prediction, searching through X posts, identifying weird photos)
Unique advantage in real-time data from the X (formerly Twitter) dataset, providing up-to-date and rich information not accessible to competitors
Model demonstrates ability to generate visualizations and simulations (e.g., black hole collisions) using accessible resources, though limited by browser-based computation

Grok 4 achieves high scores on major benchmarks like GPQA and the American Invitational Mathematics Exam (AIME 2025), obtaining a perfect AIME score
Outperforms previous top models by significant margins, particularly on reasoning benchmarks
Weakness noted in current version's image understanding and generation, but improvements planned with upcoming version 7 of the foundation model

Introduction of new, high-quality voice modes ("Sal," "Eve") with improved latency and naturalness, sometimes demonstrated through creative tasks (e.g., singing operas about Diet Coke)
Voice interactions aim for calm and smooth conversational styles, competing with OpenAI's advanced voice mode
Voice model latency has been halved and user base has increased tenfold since launch

Grok 4 is available via API at launch, facilitating integration for developers into applications
Demonstrates leading performance on ARC AGI test (15.8% accuracy, double that of the next-best model)
Intelligence-per-dollar of Grok 4 is described as uniquely high

Grok 4 tested by Endon Labs on "Vending Bench," an AI simulation where it manages virtual vending machine businesses
Model achieved consistent, high performance: double the net worth of previous best models, maintaining strategies over long sessions
Early adopters in biomedical and financial sectors automate research flows (e.g., scanning experiment logs or analyzing financial data) using Grok 4

Game developers can use Grok 4 for sourcing assets and automating repetitive tasks in game creation
Plans to enhance video understanding, allowing the model to assess and play video games, slated for the upcoming version 7
AI-generated video games predicted for next year, with skepticism regarding AI's ability to judge subjective qualities like fun or humor

Grok 4 and Grok 4 Heavy are available at release; coding model expected in August, multimodal agent in September, and video generation in October
Host plans deep-dive videos and testing in the near future, encouraging subscriptions for further content

Grok 4 is HERE! and it's the best? (Livestream Reaction)