SUMM

The speaker reflects on six months of advancements following reinforcement learning (RL) with verifiable rewards, noting that reasoning models are now common and major players like OpenAI are leading but less transparent.
Reasoning models are enabling new language model applications, exemplified by the speaker’s use of “03” for efficient information retrieval compared to traditional search.
Emerging language model tools, such as Deep Research and Cloud Code, can creatively process specific website content and assist in personal projects.
Fully autonomous agents and coding assistants are appearing, though their deployment is limited by current hardware or access constraints.
Recent breakthroughs in reasoning models underlie significant improvements in language model capabilities and application potential.

Continued AI gains require deliberate research and innovation, not just scaling; new model abilities must be deliberately trained.
A taxonomy for next-generation models is proposed: skills (math, code), calibration (matching effort to task complexity), strategy (high-level directional planning), and abstraction (decomposing complex problems into manageable parts).
Calibration is increasingly important for product reliability, as models risk overthinking and wasting computation, leading to higher costs and user dissatisfaction.
Current models tend to quickly dive into problem-solving without effective high-level strategizing or abstraction, often resulting in inefficiency.
Progress in tool use and agentic abilities is noted, with ongoing challenges in measurement and consistency.

Lack of calibration leads to unnecessarily verbose outputs and wasted resources, especially for simple queries.
Comparison data shows reasoning-trained models can use 10–100x more tokens than standard models for the same task, affecting latency and infrastructure cost.
There is a need for models to natively adjust their reasoning depth to the complexity of the user’s request, improving both efficiency and user experience.
High-level strategy in planning remains absent; current models rarely outline a plan before tackling subproblems, leading to wasted computation and longer response times.
The ultimate goal is for models to internalize planning abilities, rather than requiring explicit prompting from users.

Effective application of reasoning models requires innovations in memory management, parallel computation, and dynamic task decomposition.
Language models should be able to coordinate multiple subprocesses or agents (e.g., parallel cloud code executions), not just single, linear completions.
Developing robust planning behaviors will likely parallel the development of reasoning traces—requiring large-scale human annotation or supervision and extensive RL tuning.
Planning data is easier to collect and verify than complex reasoning traces, making iterative improvement more feasible.
Structured outputs, such as explicit stepwise plans before generating answers, could scaffold improved planning capabilities.

Current models like “03” excel at specialized skills (e.g., search) but lack consistency and thoroughness in broader, multi-step tasks due to weak planning.
Improving planning would make outputs more comprehensive and trustworthy, enhancing utility in tasks like research or recommendations.
By breaking down tasks into skill, calibration, strategy, and abstraction, researchers can better target data collection and algorithmic improvements.

Parallel compute adds robustness and can refine model outputs but does not address the fundamental need for better exploration in long-horizon RL tasks.
Scaling RL for reasoning and planning is highly tractable and is expected to continue expanding as a research focus.
A summarized research plan: gather diverse, verifiable questions; filter for optimal RL learning (not too easy/hard); ensure stable infrastructure for RL training; and implement incremental improvements based on research findings.
Attention is shifting toward “post-training” (fine-tuning through RL), with compute investment rising—OpenAI reportedly increased post-training compute by 10x in recent iterations.

Post-training (RL-based) is rapidly approaching parity with pre-training in terms of compute investment, contrasting older approaches where post-training was a small fraction.
Example: DeepSeek’s post-training compute share reportedly rose from 0.18% of GPU time to potentially 10–20%, signaling an industry trend.
The push toward solving long-horizon, planner-style tasks will require further scaling of both RL and infrastructure.
Embracing task breakdown and hierarchical planning capabilities is seen as essential for the next generation of language models.

A Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai