Safety and security for code executing agents — Fouad Matin, OpenAI (Codex, Agent Robustness)

Introduction and Context 00:03

  • Fouad Matin introduces himself as a member of OpenAI's security team with a background in running a security startup.
  • He works on agent robustness and control, focusing on safety measures for code-executing AI agents.
  • Recently worked on Codeex and Codeex CLI, OpenAI's open source library for running code-executing agents locally.

The Rise of Code-Executing Agents 00:54

  • Major research labs are emphasizing usability and deployability, not just coding benchmarks for agents.
  • Current trend is enabling agents to both write and execute code to achieve objectives efficiently.
  • Modern models (like 03 and 04 mini) are more reliable and capable compared to year-old models.
  • Models’ abilities create new questions about what tasks should be permitted and the necessary guardrails.
  • Code execution is applicable beyond typical software engineering tasks, including multimodal reasoning (such as OCR and image cropping).

Architectures and Security Risks 02:34

  • Traditional models relied on complex logic loops and separate prompts for task determination and execution.
  • Newer approaches allow models to independently decide when and how to use tools and write/run code.
  • From a security perspective, these developments introduce concerns similar to remote code exploitation (RCE).
  • Common model failure vectors: prompt injection, data exfiltration, accidental installation of malicious/vulnerable packages, privilege escalation, and sandbox escapes.

Safeguarding Code-Executing Agents 04:01

  • OpenAI employs a preparedness framework, emphasizing the need for robust safeguards against agent misalignment especially at scale.
  • Organizations deploying coding agents must also implement similar measures.
  • Primary safeguard: sandboxing agents, ideally running them on isolated computers or containers for maximum safety.
  • When running agents locally, containerization, app-level, or OS-level sandboxing should be used as guardrails.

Limiting Internet Access and Reviewing Operations 05:02

  • Disabling or limiting agent internet access is crucial to mitigate prompt injection and data leakage.
  • Internet-enabled agents can inadvertently process untrusted, malicious prompts, especially from sources like GitHub issues with embedded instructions.
  • Human review of agent operations (like GitHub PRs or approvals) is a key mitigation, ensuring humans remain in control.
  • Balancing operational flexibility with security: avoid over-reliance on either complete automation or constant human approval.

Technical Implementation: Sandboxing and Policies 06:05

  • Best practice is giving agents their own isolated compute environment.
  • OpenAI’s Codeex CLI, now open source, offers reference implementations for agent sandboxing on Mac OS and Linux.
  • For Mac OS, sandboxing uses Apple's seatbelt framework, inspired by Chromium’s approach.
  • For Linux, sandboxing combines SECcomp and Landlock, developed in Rust with input from OpenAI’s security team.
  • The goal is to have unprivileged sandboxed environments that prevent privilege escalation.

Internet Access Controls and Risks 07:45

  • Prompt injection is the primary risk explored when agents have internet access.
  • Codeex and ChatGPT now allow configurable internet access with allow-lists, including HTTP method controls and warning systems.
  • Example: Agents can unknowingly post sensitive data if prompted from user-generated content containing harmful instructions.
  • Best practice is combining model-level protections (detecting suspicious actions) with hard system-level restrictions to prevent unauthorized actions.

Human Review and Monitoring 09:49

  • Human review is still essential as language models can generate large volumes of code that require oversight.
  • Automated code review tools and LLM-based reviews can help but do not replace manual human judgment.
  • Monitoring tools (e.g., operator with domain lists and action monitors) can help identify and flag sensitive or risky operations for human intervention.
  • The challenge remains to balance security (manual review, monitoring) and usability (automation, flexibility).

Tools and Future Directions 11:05

  • New tools like Local Shell APIs and apply patch formats assist agents in performing tasks while improving robustness (e.g., handling git diffs more reliably).
  • External services, like MCP for dependency vulnerability checks, can be integrated for additional safety.
  • Strongly recommend using remote containers for agent execution; OpenAI plans to offer container services as part of their Agents SDK and API.
  • Flexibility is offered between local and OpenAI-hosted environments.

Recap and Looking Forward 12:29

  • Key recommendations: sandboxing agents (containers or OS-level), disabling/limiting internet, and requiring human review.
  • LC model monitoring is improving but not yet a substitute for deterministic system controls.
  • More tooling and documentation are planned for release, addressing both ML-based and system-level interventions.
  • OpenAI is hiring for the Agent Robustness and Control team and for Rust development on Codeex CLI.
  • Viewers are encouraged to participate and contribute to open source and future developments.