Training Agentic Reasoners — Will Brown, Prime Intellect

Introduction and High-Level Thesis 00:00

Will Brown introduces the discussion on training agentic reasoners, highlighting the similarities between reasoning and agents.
Emphasizes that reinforcement learning (RL) now works effectively at scale, referencing DeepSeek as a notable example.

Reinforcement Learning’s Role in Agentic Models 01:05

RL, when applied with a good setup and signal, results in model improvement; this underpins the approach of major AI labs.
OpenAI’s "03" model is highlighted as agentic, excelling in tool use for complex tasks.
RL is credited as the key for maintaining performance as systems grow in complexity, preventing them from becoming brittle.
Conducting RL at scale, particularly outside big labs, remains a research challenge but is becoming more accessible.

Architecture and Practical Considerations 03:03

RL setups like Veril (for research) and approaches described in DeepSeek papers involve complex architectures that can be intimidating.
Realistically, some complexity is necessary for performance, but best practices are emerging that lower this barrier over time.
There’s an advantage for those who can RL open models for specific tasks, which could foster defensible product moats for smaller organizations.

The Connection Between Agents and Reinforcement Learning 05:45

Many successful agentic products rely on RL-trained models, explaining their problem-solving abilities with tools.
Building and optimizing an agent equates conceptually to RL: harnesses/environments, tools, iteration, policies, actions, states, and rewards.

Manual Tuning vs. Automated RL 06:39

Manual prompt tuning and harness iteration resemble "doing RL by hand."
Automated RL algorithms streamline this process by learning from successful (or unsuccessful) attempts.
Key algorithms in the space include PO, GRPO, and DPO, each with trade-offs between granularity, computational efficiency, and implementation complexity.

Navigating Rapid Research Developments 09:09

The proliferation of new RL papers makes it challenging to identify meaningful advances.
Suggests a holistic focus on the overarching RL process, leaving detailed optimization to maturing software tools.

Importance of Tools for Agents 10:20

The defining feature of agents is the ability to use tools to interact with environments.
Examples include MCP (which provides LMs access to tools), code editing, file changes, etc.
Most RL research tools and code are currently tailored for code and math tasks, due in part to ease of evaluation.

Challenges in Reward Design and Evaluation 11:43

Real-world tasks are messier than benchmarks; simple benchmarks can’t drive system progress alone.
Properly designing reward signals (evals) is critical—rewards must encourage desired behavior rather than reward hacking.

The Pursuit of Better Evaluation and Rubrics 13:13

Effective evaluation strategies should make correct task completion easier than cheating or "gaming" the system.
Discusses the emerging use of LMs as subroutines in evaluation via “rubrics,” allowing nuanced, task-specific, on-the-fly assessment.
References DeepSeek and other work demonstrating the possibility of such dynamic, fine-grained evaluations, especially for ambiguous tasks.

Multi-Turn and Multi-Agent Systems 14:52

Advocates for multi-step, agentic search and planning, with long-horizon use of tools for more complex tasks.
The environment, reward, and policy in RL correspond to harnesses, evals, and LMs, respectively, in agent development.

The "Verifiers" Toolkit and Lowering the Barrier for RL 15:41

Introduces "verifiers," a toolkit designed so building trainable agents via RL is as simple as building standard agents.
Demonstrates simplified multi-turn RL through toy problems (e.g., Wordle agent).
The toolkit allows for debugging with standard APIs, synthetic data training (SFT), and then transition to RL.
Engineering efforts focus on computational efficiency, async operation, and low entry barriers for research and experimentation.

Encouragement and Closing Remarks 18:28

Invites more people to experiment with agentic RL, as tools now make sophisticated research feasible on just a couple of GPUs.
Mentions ongoing improvements and accessibility, aiming to spread adoption of these techniques.
Session concludes without a Q&A.

Home Submit Saved