SUMM

The session shifts from basic prompting to focusing on prompting for AI agents.
Prompt engineering is described as "programming in natural language" aimed at clearly instructing agents with tasks, examples, and guidelines.
Traditional structured prompts for console use are not typically suitable for agents; agent prompts are more flexible to accommodate varied contexts.

An agent is a model operating in a loop, using tools autonomously to accomplish tasks and updating its actions based on tool feedback.
The core components for agents are: the environment, available tools, and a system prompt stating the agent's objective.
Simple instructions usually work best to let the agent operate effectively.
Agents should be reserved for complex, high-value tasks that require autonomy and decision-making, not for tasks with clear, step-by-step solutions.
Task selection considerations include: task complexity, value of completion, tool availability, and error costs.
Human oversight or alternative workflows may be preferable where errors are costly or hard to detect.

Coding is highlighted as a strong agent use case, especially when the path from a design document to a finished product is not fully clear.
Other viable examples: tasks involving web search (error recovery is possible), computer use (trial and error is acceptable), and data analysis (uncertainty in data structure or processing path).

Think like your agent: Develop a deep understanding of the agent’s environment and simulate its experience to optimize prompts.
Give agents reasonable heuristics and explicit guidelines, such as stopping rules or tool use limits, to avoid unintended behaviors (e.g., infinite searching).
Treat prompt engineering as conceptual engineering: define abstract behaviors and principles rather than just textual instructions.
Use clear and explicit instructions for tool selection, explaining which tools apply in which contexts and why.
Guide the agent’s thinking process by prompting it to plan its approach up front and reflect between tool calls, particularly to handle uncertainty in intermediate results.

Agents are less predictable than simple classification models; changes to prompts can lead to unexpected side effects.
It's critical to explicitly define stopping conditions and manage error-handling strategies to prevent inefficient behaviors.
Use strategies to manage the model’s context window (e.g., summary compaction or writing memory to an external file) to support long-running autonomous tasks.
Consider multi-agent (sub-agent) architectures to delegate tasks and compress results, helping with context window limitations.

Test tools for clarity and functionality; tool names and descriptions must be unambiguous and easily understood.
Avoid giving agents numerous similar tools, which can lead to confusion; consolidate or differentiate tools as much as possible.
Example scenario: An agent uses search and database tools in sequence, reflects on findings, and autonomously takes actions like generating invoices and sending emails.

The Anthropic console can simulate agent prompts and display the agent’s decision-making in real time.
Start with a minimal prompt and incrementally refine it based on observed behavior and performance in various test cases.
Example walkthrough: The agent is tasked with estimating how many bananas fit in a Rivian R1S, autonomously searching for specs, reasoning through sub-steps, and using parallel tool calls for efficiency.

Systematic evaluation (evals) is crucial to measure and improve prompts and agent behavior.
Start with small, manual evals; avoid overcomplicating evaluation pipelines at the outset.
Use realistic, domain-relevant tasks for evaluation rather than contrived or irrelevant examples.
Leverage LLM-as-judge approaches with specific rubrics for flexible assessment of agent outputs.
Evaluate tool use accuracy and whether the agent reaches the correct final state (e.g., database modifications) for robust benchmarking.

Begin prompt development with short, simple instructions; only expand and add detail in response to observed failures or requirements.
Collect and use test cases to guide prompt refinements and address edge cases.
Traditional few-shot example techniques from standard prompting are less effective with state-of-the-art agents; instead, focus on concise planning and clear thinking instructions rather than prescriptive processes.
Agents typically already have reasoning strategies like chain-of-thought ingrained and benefit more from freedom within clear boundaries than restrictive examples.