SUMM

The presenter, David, introduces himself as the creator of Claude Plays Pokemon.
The session is focused on advancements in tool use, particularly related to Claude's ability to play Pokemon.
He launches a live demo of Claude playing Pokemon, engaging the audience in a countdown.

Highlights new capabilities in the models that enable improved tool use and agentic behavior.
Extended thinking mode allows Claude to better plan and adapt, particularly evident during challenging scenarios like the Pokemon name entry screen.
The model now builds comprehensive plans and can reconsider its assumptions between tool calls.
Parallel tool calling allows the model to call multiple tools at once, improving efficiency compared to previous versions that could only call one tool at a time.
Faster action-taking results from multiple simultaneous tool calls, making agents more efficient.

Tool use has shifted from simple calculators to enabling complex agentic behaviors.
In agentic loops, the model plans an action, executes, learns from the outcome, and repeats until the goal is met.
Practical Pokemon example: pressing buttons, reflecting on outcomes, and updating plans accordingly.
Improved planning and acting in the new models are possible due to extended thinking and parallel tool calls.

Previously, models would draft inadequate initial plans and struggled with unexpected scenarios (e.g., cursor movement on the name screen).
With extended thinking, Claude can catch and adapt to errors during tool use in real time (e.g., realizing how cursor wrapping works).
The model's adaptability during execution is emphasized as crucial for building robust agents.

Parallel tool calling mainly improves efficiency by reducing wait times between tool actions and updates.
The model can now perform in-game actions and update knowledge bases concurrently, streamlining agent performance.
For developers, this means quicker and more effective agents, enhancing end-user experience.

Ongoing improvements aim to make Claude smarter over long tasks and easier to use as an agent.
Enthropic listens to developer feedback and iterates on model design, with a focus on practical features like parallel tool calling.
Extended thinking and usability advancements are continually integrated based on user needs.

Discussion of high-level vs. low-level actions and how tools can be structured (flat vs. hierarchical).
In practice, separating tool purposes and clearly defining scenarios for their use leads to better agent outcomes.
Observing agent struggles informs better tool and prompt design.

Addressing where best to define tool usage guidelines: prompt vs. tool description.
Both locations are effective; clarity and detailed descriptions are most important.
Having consistent formats (like JSON schema) can help, but main consideration is that the model understands and applies the tool as intended.

Claude Opus shows significant improvements in planning and executing multi-step game objectives.
Although some visual interpretation limitations persist (e.g., game screen navigation), task planning and sustained attention have greatly improved.
Notable achievements include successfully accomplishing complex in-game quests over long timeframes.

Parallel tool calling is not entirely novel, but is a valuable addition for practical usage.
The model may now return multiple tool calls in a single API call, which must then be handled by the developer's system.
Discussion of how excessive or poorly timed parallel actions (e.g., spamming 'A' in dialogues) can lead to side effects that developers must anticipate and mitigate with careful prompting.

Opus models are more reliable at following complex instructions and handling large toolsets (50-100 tools or more).
Precise and well-considered instructions are critical, as the model will closely follow whatever it's given—even if contradictory or unclear.
Success with many tools depends on clear boundaries between tool definitions and accuracy in prompts; well-designed tools yield consistent results across a broad set.

The session concludes with gratitude to attendees and a note that the discussion diverged from expectations but provided valuable insights into Claude's tool use and agentic capabilities.

Bringing new tool use advancements to life: Claude Plays Pokemon