SUMM

The new ChatGPT agent excels at multi-turn conversations and maintaining tasks over long periods.
A major focus is on improving agent memory and personalization, with future goals including proactive agent actions.
This episode features the OpenAI team behind the agent, discussing the leap in capabilities from unifying prior tools (deep research and operator) into a seamless system.

The agent combines text browsing, a GUI (visual) browser, terminal access (for code/data tasks), and shared tool state within a virtual computer.
It allows fluid transitions between reading text, interacting with web pages visually, executing code, and manipulating files or APIs (e.g., GitHub, Google Drive).
Users benefit from an environment similar to a real computer, offering flexibility and complex task execution.

Deep research and operator (previous separate products) were merged due to their complementary strengths—efficient text reading and web interaction for the former, advanced GUI handling for the latter.
Additional tools like a terminal and image generation were integrated, resulting in a powerful multi-functional agent.
Shared state among tools enables seamless switching and complex workflows.

Trained for tasks like generating detailed research reports, booking flights, making purchases, creating slide decks, and conducting data analysis.
The design is intentionally open-ended to discover unexpected user use cases.
Both consumer and business users are targeted; early users have used it for data organization, online shopping, coding, and synthesizing emerging research.
Example: The agent estimated OpenAI’s valuation, created a financial model and projections, assembled a spreadsheet, and generated slides—completing the task in about 28 minutes.

Some tasks have run as long as an hour without errors.
The agent extends beyond previous context limits by documenting its steps, allowing for extensive, uninterrupted task completion.
Human users can interact mid-task—correcting, clarifying, or requesting status updates—mirroring real-world collaboration.
Users can observe, interrupt, or take over the agent's virtual environment as needed.

Training leverages reinforcement learning, allowing the model to self-discover optimal tool usage across thousands of virtual machines.
Diverse and challenging tasks are used in training, rewarding efficiency and correctness.
The model chooses when and how to switch between tools, rather than being explicitly programmed for tool selection.

Introduction of real-world, side-effect-laden actions increases risk compared to prior “read-only” agents.
The agent includes robust monitoring, with layered mitigations for safety and security (e.g., anomaly detection, stopping on suspicious activity).
Ongoing internal and external red teaming addresses a range of risks, including biohazards and potential for harmful actions.
Rapid response systems are in place to update safety protocols for emerging threats.

Small, tightly-knit teams from deep research and operator (research and applied sides) merged for this project.
Close collaboration between research, engineering, and design, with product ambitions guiding the backward design process.
Training stability and handling a large fleet of VMs were significant challenges, given the variety and complexity of tools involved.

Ambitions include supporting any computer task, enhancing accuracy and expanding tool capabilities.
Continued iterative deployment will surface new user-discovered capabilities.
Ongoing development includes finer personalization, agent proactivity, improved UI/UX, and continued work on agent “memory.”
The aim is to achieve a single, generalist agent rather than many narrow sub-agents, as skills transfer across domains.
Reinforcement learning enables efficient training with smaller, high-quality curated datasets.

Advances in compute and training scale (100,000x increase over earlier efforts) have made previously intractable problems solvable.
The agent outperforms human baselines in certain data science evaluations, such as spreadsheet analysis.
Basic actions like online form filling and navigation have become more reliable, though some challenges (like date picking) persist.

The agent’s access to a general virtual computer enables it to address a vast array of human-computer interaction tasks.
The team foresees new paradigms for interacting with virtual assistants and is focused on making the model adept at as many computer-based tasks as possible.
There’s considerable excitement about both expanding technical capabilities and exploring new user experiences with agents.

OpenAI Just Released ChatGPT Agent, Its Most Powerful Agent Yet