SUMM

Fouad Matin introduces himself as a member of OpenAI's security team with a background in running a security startup.
He works on agent robustness and control, focusing on safety measures for code-executing AI agents.
Recently worked on Codeex and Codeex CLI, OpenAI's open source library for running code-executing agents locally.

Major research labs are emphasizing usability and deployability, not just coding benchmarks for agents.
Current trend is enabling agents to both write and execute code to achieve objectives efficiently.
Modern models (like 03 and 04 mini) are more reliable and capable compared to year-old models.
Models’ abilities create new questions about what tasks should be permitted and the necessary guardrails.
Code execution is applicable beyond typical software engineering tasks, including multimodal reasoning (such as OCR and image cropping).

Traditional models relied on complex logic loops and separate prompts for task determination and execution.
Newer approaches allow models to independently decide when and how to use tools and write/run code.
From a security perspective, these developments introduce concerns similar to remote code exploitation (RCE).
Common model failure vectors: prompt injection, data exfiltration, accidental installation of malicious/vulnerable packages, privilege escalation, and sandbox escapes.

OpenAI employs a preparedness framework, emphasizing the need for robust safeguards against agent misalignment especially at scale.
Organizations deploying coding agents must also implement similar measures.
Primary safeguard: sandboxing agents, ideally running them on isolated computers or containers for maximum safety.
When running agents locally, containerization, app-level, or OS-level sandboxing should be used as guardrails.

Disabling or limiting agent internet access is crucial to mitigate prompt injection and data leakage.
Internet-enabled agents can inadvertently process untrusted, malicious prompts, especially from sources like GitHub issues with embedded instructions.
Human review of agent operations (like GitHub PRs or approvals) is a key mitigation, ensuring humans remain in control.
Balancing operational flexibility with security: avoid over-reliance on either complete automation or constant human approval.

Best practice is giving agents their own isolated compute environment.
OpenAI’s Codeex CLI, now open source, offers reference implementations for agent sandboxing on Mac OS and Linux.
For Mac OS, sandboxing uses Apple's seatbelt framework, inspired by Chromium’s approach.
For Linux, sandboxing combines SECcomp and Landlock, developed in Rust with input from OpenAI’s security team.
The goal is to have unprivileged sandboxed environments that prevent privilege escalation.

Prompt injection is the primary risk explored when agents have internet access.
Codeex and ChatGPT now allow configurable internet access with allow-lists, including HTTP method controls and warning systems.
Example: Agents can unknowingly post sensitive data if prompted from user-generated content containing harmful instructions.
Best practice is combining model-level protections (detecting suspicious actions) with hard system-level restrictions to prevent unauthorized actions.

Human review is still essential as language models can generate large volumes of code that require oversight.
Automated code review tools and LLM-based reviews can help but do not replace manual human judgment.
Monitoring tools (e.g., operator with domain lists and action monitors) can help identify and flag sensitive or risky operations for human intervention.
The challenge remains to balance security (manual review, monitoring) and usability (automation, flexibility).

New tools like Local Shell APIs and apply patch formats assist agents in performing tasks while improving robustness (e.g., handling git diffs more reliably).
External services, like MCP for dependency vulnerability checks, can be integrated for additional safety.
Strongly recommend using remote containers for agent execution; OpenAI plans to offer container services as part of their Agents SDK and API.
Flexibility is offered between local and OpenAI-hosted environments.

Key recommendations: sandboxing agents (containers or OS-level), disabling/limiting internet, and requiring human review.
LC model monitoring is improving but not yet a substitute for deterministic system controls.
More tooling and documentation are planned for release, addressing both ML-based and system-level interventions.
OpenAI is hiring for the Agent Robustness and Control team and for Rust development on Codeex CLI.
Viewers are encouraged to participate and contribute to open source and future developments.

Safety and security for code executing agents — Fouad Matin, OpenAI (Codex, Agent Robustness)