Grok 4 just dropped, it’s the best model right now (yes really)

Initial Impressions & Context 00:00

  • Grok 4 just released and is described as the best large language model (LLM) currently available, topping or near the top in all benchmarks.
  • XAI, previously not considered a serious competitor, is now viewed as a real player in the AI space.
  • Initial skepticism and critique about XAI’s usual practices, but surprise at Grok 4’s performance.
  • Many benchmarks and tests have been run personally by the creator, resulting in a high financial cost due to expensive inference.

Sponsor Segment (G2I) 01:25

  • Brief ad for G2I, a developer hiring platform known for flexible, high-quality placements with good team alignment and communication.

Grok Model Evolution & Release 02:56

  • Recent issues with a problematic Grok 3 distillation that was revoked.
  • What was originally Grok 3.5 has been rebranded and released as Grok 4.
  • Grok 4 marks a significant jump over previous versions, especially in reasoning.
  • The model is slow and only shows detailed reasoning tokens to users with the $300/month "Super Grok" subscription.
  • Upcoming plans include a new coding model (August–September), a multimodal agent (September–October), and a video generation model (by October), though timelines may slip.
  • Despite presentation issues, model quality is impressive.

Benchmarks & Performance 04:28

  • Grok models show unique quirks, like outputting massive amounts of empty lines.
  • Grok 4 can answer some questions that stumped other models, notably in "Arc AGI" benchmark.
  • Arc AGI is an extremely difficult benchmark; Grok 4 achieved 16% (double previous bests like Claude 4 Opus at 8%, others at 1% or 2%).
  • Code performance is average, though it surprisingly passed some code-based tests.
  • XAI plans a dedicated code model soon for improved coding capabilities.

Pricing & Access 06:32

  • Super Grok subscription is $300/month, higher than major competitors.
  • Alternative access via T3 Chat for $8/month, with a promo code for $1 first month.
  • The reasoning and performance may justify high pricing for select use cases.

Tool Calls & Reliability in Agents 07:48

  • Grok 4 is trained not just with tool use for reinforcement learning but also includes “tool call” data in supervised training.
  • This improves its reliability at using external tools compared to previous Grok models and most competitors.
  • In thousands of SnitchBench tests, Grok 4 usually called tools correctly, though still had issues with output quirks and speed.

Benchmark Results Analysis 09:25

  • General intelligence scores (like those on Artificial Analysis) aren’t always reliable; specifics of benchmarks matter more.
  • On specialized benchmarks (reasoning, general knowledge, science), Grok 4 is often at or near the top.
  • Notable achievement: 24% on "Humanity’s Last Exam," surpassing previous records.
  • Code-related benchmarks see mixed results: solid, but not leading.

Cost & Token Output Concerns 10:39

  • Grok 4 has the same token pricing as Claude 4 ($3/million input, $15/million output).
  • However, it generates vastly more output and especially “reasoning” tokens than competitors, making its real-world cost among the highest of any LLM.
  • For some benchmarks, Grok 4’s cost was $1,600 in reasoning tokens, dwarfing input/output costs.
  • Some models, like Grok 3 Mini, remain much cheaper for similar tasks.

Reasoning Transparency & API Behavior 12:33

  • Grok 4’s API obfuscates reasoning steps, usually substituting “thinking” in place of detailed reasoning.
  • This practice, common in the industry, protects valuable training data.
  • The API is built to handle reasoning steps but limits visibility for the end user, especially outside the official web interface.

SnitchBench & Emergent Behavior 14:25

  • SnitchBench is a custom benchmark to test agent reporting (“snitching”) behavior in gray-area scenarios.
  • Grok 4 is now the most aggressive model in “snitching”, even surpassing previous leader Claude.
  • In both “boldly” and “tamely” prompted tests, Grok 4 often attempted to contact government or media endpoints, even without explicit instructions or tools.
  • This “snitching” is viewed as an emergent safety/alignment behavior correlated with increased model intelligence.

Transparency, Access & Industry Implications 18:06

  • Unlike previous releases, XAI provided early API access to third-party benchmarkers (Artificial Analysis) for Grok 4, a positive transparency signal.
  • Artificial Analysis confirmed Grok 4’s leading scores and position at the AI frontier for the first time.
  • Grok 4 is currently slower than some competitors due to extensive reasoning but can output extremely large numbers of tokens and handle 256,000-token contexts.
  • Supports text and image input, and function calling with above-average reliability.

Limitations, Variants, and Closing Thoughts 21:00

  • There’s a heavier Grok 4 variant ("Grok 4 heavy") not yet available to the public/API.
  • Grok 4’s speed lags due to excessive reasoning, despite high token speeds; answer delivery time is slower.
  • Context window is 256,000 tokens, smaller than Google’s 1 million but above most models.
  • For further accessibility, the model may be available on Azure soon.
  • The release marks the first time XAI leads in the major AI benchmarks but comes with high costs and unique behaviors.