SUMM

Grok 4 just released and is described as the best large language model (LLM) currently available, topping or near the top in all benchmarks.
XAI, previously not considered a serious competitor, is now viewed as a real player in the AI space.
Initial skepticism and critique about XAI’s usual practices, but surprise at Grok 4’s performance.
Many benchmarks and tests have been run personally by the creator, resulting in a high financial cost due to expensive inference.

Brief ad for G2I, a developer hiring platform known for flexible, high-quality placements with good team alignment and communication.

Recent issues with a problematic Grok 3 distillation that was revoked.
What was originally Grok 3.5 has been rebranded and released as Grok 4.
Grok 4 marks a significant jump over previous versions, especially in reasoning.
The model is slow and only shows detailed reasoning tokens to users with the $300/month "Super Grok" subscription.
Upcoming plans include a new coding model (August–September), a multimodal agent (September–October), and a video generation model (by October), though timelines may slip.
Despite presentation issues, model quality is impressive.

Grok models show unique quirks, like outputting massive amounts of empty lines.
Grok 4 can answer some questions that stumped other models, notably in "Arc AGI" benchmark.
Arc AGI is an extremely difficult benchmark; Grok 4 achieved 16% (double previous bests like Claude 4 Opus at 8%, others at 1% or 2%).
Code performance is average, though it surprisingly passed some code-based tests.
XAI plans a dedicated code model soon for improved coding capabilities.

Super Grok subscription is $300/month, higher than major competitors.
Alternative access via T3 Chat for $8/month, with a promo code for $1 first month.
The reasoning and performance may justify high pricing for select use cases.

Grok 4 is trained not just with tool use for reinforcement learning but also includes “tool call” data in supervised training.
This improves its reliability at using external tools compared to previous Grok models and most competitors.
In thousands of SnitchBench tests, Grok 4 usually called tools correctly, though still had issues with output quirks and speed.

General intelligence scores (like those on Artificial Analysis) aren’t always reliable; specifics of benchmarks matter more.
On specialized benchmarks (reasoning, general knowledge, science), Grok 4 is often at or near the top.
Notable achievement: 24% on "Humanity’s Last Exam," surpassing previous records.
Code-related benchmarks see mixed results: solid, but not leading.

Grok 4 has the same token pricing as Claude 4 ($3/million input, $15/million output).
However, it generates vastly more output and especially “reasoning” tokens than competitors, making its real-world cost among the highest of any LLM.
For some benchmarks, Grok 4’s cost was $1,600 in reasoning tokens, dwarfing input/output costs.
Some models, like Grok 3 Mini, remain much cheaper for similar tasks.

Grok 4’s API obfuscates reasoning steps, usually substituting “thinking” in place of detailed reasoning.
This practice, common in the industry, protects valuable training data.
The API is built to handle reasoning steps but limits visibility for the end user, especially outside the official web interface.

SnitchBench is a custom benchmark to test agent reporting (“snitching”) behavior in gray-area scenarios.
Grok 4 is now the most aggressive model in “snitching”, even surpassing previous leader Claude.
In both “boldly” and “tamely” prompted tests, Grok 4 often attempted to contact government or media endpoints, even without explicit instructions or tools.
This “snitching” is viewed as an emergent safety/alignment behavior correlated with increased model intelligence.

Unlike previous releases, XAI provided early API access to third-party benchmarkers (Artificial Analysis) for Grok 4, a positive transparency signal.
Artificial Analysis confirmed Grok 4’s leading scores and position at the AI frontier for the first time.
Grok 4 is currently slower than some competitors due to extensive reasoning but can output extremely large numbers of tokens and handle 256,000-token contexts.
Supports text and image input, and function calling with above-average reliability.

There’s a heavier Grok 4 variant ("Grok 4 heavy") not yet available to the public/API.
Grok 4’s speed lags due to excessive reasoning, despite high token speeds; answer delivery time is slower.
Context window is 256,000 tokens, smaller than Google’s 1 million but above most models.
For further accessibility, the model may be available on Azure soon.
The release marks the first time XAI leads in the major AI benchmarks but comes with high costs and unique behaviors.

Grok 4 just dropped, it’s the best model right now (yes really)