SUMM

Anthropic released Claude Opus 4.1, an updated version of Claude Opus 4.0.
The update focuses on agentic tasks, real-world coding, and reasoning improvements.
Anthropic plans to release even larger improvements in upcoming weeks.

Opus 4.1 achieved 74.5% on SWEBench verified, up from 72.5% in version 4.0 and 62.3% in Sonnet 3.7.
Terminal Bench score improved from 39.2 to 43.3.
Graduate level reasoning (GPQA diamond) score saw a minor increase from 79.6 to 80.9.
Agentic tool use (towbench) for retail improved to 82.4 from 81.4, but dropped for airline scenarios to 56% from 59.6%.
Multilingual Q&A performance rose to 89.5 from 88.8, visual reasoning had a small increase, and AMI 2025 A saw a 2.5-point boost to 78%.

Opus 4.1 outperforms OpenAI's GPT-4o ("O3") and Gemini 2.5 Pro on SWEBench and Terminal Bench.
It trails both competitors on GPQA diamond, agentic tool use, and is notably behind in high school math competition (Opus 4.1 at 78%, O3 at 88.9%, Gemini 2.5 Pro at 88%).

The actual value of the model becomes clear only through real-world usage, not just benchmarks.
Claude is recognized as the top coding model currently available, especially for agentic coding.
The update details were brief, and further testing is planned.
Viewers are encouraged to test the model and share feedback.

Claude Just Got a Big Update (Opus 4.1)