Training Agentic Reasoners — Will Brown, Prime Intellect

Introduction and High-Level Thesis 00:00

  • Will Brown introduces the discussion on training agentic reasoners, highlighting the similarities between reasoning and agents.
  • Emphasizes that reinforcement learning (RL) now works effectively at scale, referencing DeepSeek as a notable example.

Reinforcement Learning’s Role in Agentic Models 01:05

  • RL, when applied with a good setup and signal, results in model improvement; this underpins the approach of major AI labs.
  • OpenAI’s "03" model is highlighted as agentic, excelling in tool use for complex tasks.
  • RL is credited as the key for maintaining performance as systems grow in complexity, preventing them from becoming brittle.
  • Conducting RL at scale, particularly outside big labs, remains a research challenge but is becoming more accessible.

Architecture and Practical Considerations 03:03

  • RL setups like Veril (for research) and approaches described in DeepSeek papers involve complex architectures that can be intimidating.
  • Realistically, some complexity is necessary for performance, but best practices are emerging that lower this barrier over time.
  • There’s an advantage for those who can RL open models for specific tasks, which could foster defensible product moats for smaller organizations.

The Connection Between Agents and Reinforcement Learning 05:45

  • Many successful agentic products rely on RL-trained models, explaining their problem-solving abilities with tools.
  • Building and optimizing an agent equates conceptually to RL: harnesses/environments, tools, iteration, policies, actions, states, and rewards.

Manual Tuning vs. Automated RL 06:39

  • Manual prompt tuning and harness iteration resemble "doing RL by hand."
  • Automated RL algorithms streamline this process by learning from successful (or unsuccessful) attempts.
  • Key algorithms in the space include PO, GRPO, and DPO, each with trade-offs between granularity, computational efficiency, and implementation complexity.

Navigating Rapid Research Developments 09:09

  • The proliferation of new RL papers makes it challenging to identify meaningful advances.
  • Suggests a holistic focus on the overarching RL process, leaving detailed optimization to maturing software tools.

Importance of Tools for Agents 10:20

  • The defining feature of agents is the ability to use tools to interact with environments.
  • Examples include MCP (which provides LMs access to tools), code editing, file changes, etc.
  • Most RL research tools and code are currently tailored for code and math tasks, due in part to ease of evaluation.

Challenges in Reward Design and Evaluation 11:43

  • Real-world tasks are messier than benchmarks; simple benchmarks can’t drive system progress alone.
  • Properly designing reward signals (evals) is critical—rewards must encourage desired behavior rather than reward hacking.

The Pursuit of Better Evaluation and Rubrics 13:13

  • Effective evaluation strategies should make correct task completion easier than cheating or "gaming" the system.
  • Discusses the emerging use of LMs as subroutines in evaluation via “rubrics,” allowing nuanced, task-specific, on-the-fly assessment.
  • References DeepSeek and other work demonstrating the possibility of such dynamic, fine-grained evaluations, especially for ambiguous tasks.

Multi-Turn and Multi-Agent Systems 14:52

  • Advocates for multi-step, agentic search and planning, with long-horizon use of tools for more complex tasks.
  • The environment, reward, and policy in RL correspond to harnesses, evals, and LMs, respectively, in agent development.

The "Verifiers" Toolkit and Lowering the Barrier for RL 15:41

  • Introduces "verifiers," a toolkit designed so building trainable agents via RL is as simple as building standard agents.
  • Demonstrates simplified multi-turn RL through toy problems (e.g., Wordle agent).
  • The toolkit allows for debugging with standard APIs, synthetic data training (SFT), and then transition to RL.
  • Engineering efforts focus on computational efficiency, async operation, and low entry barriers for research and experimentation.

Encouragement and Closing Remarks 18:28

  • Invites more people to experiment with agentic RL, as tools now make sophisticated research feasible on just a couple of GPUs.
  • Mentions ongoing improvements and accessibility, aiming to spread adoption of these techniques.
  • Session concludes without a Q&A.