OpenAI’s IMO Team on Why Models Are Finally Solving Elite-Level Math

The Pace of Progress in AI Math Models 00:00

  • AI models have rapidly advanced from struggling with grade school math to achieving elite-level performance on high school olympiad problems in just a few years
  • Recent AI benchmarks have progressed quickly from GSM8K to Math, AIME, and now USAMO and IMO
  • OpenAI's model has attained gold medal performance at the International Math Olympiad (IMO), considered a key milestone in AI development

Origins and Team Dynamics 02:16

  • The quest for the IMO gold has been a long-term ambition within OpenAI, with renewed focus in the last few months leading up to the competition
  • The core team consisted of just three people, but many others at OpenAI contributed in supporting roles
  • Researchers at OpenAI are empowered to pursue high-impact projects, and the IMO effort was initially driven by a new technique proposed by Alex
  • There was early skepticism about success, but promising results led to wider support

Verification and Grading of AI-Generated Proofs 05:27

  • Proofs generated by the model were often not human-readable but were published in their raw form for transparency
  • To verify correctness, OpenAI hired former IMO medalists to grade each proof, requiring unanimous agreement on accuracy
  • Even OpenAI researchers found the proofs too advanced to comprehend, highlighting the sophistication of the model

Tackling the Hardest Problems: Problem 6 and Model Self-Awareness 07:49

  • Problem 6 at the IMO is traditionally the hardest; this year, no models, including OpenAI's, solved it
  • The model displayed self-awareness by not attempting an answer it couldn't solve, instead responding "no answer" rather than hallucinating a solution
  • This self-awareness marks an improvement over earlier models, which would fabricate convincing but incorrect answers

Model Strengths, Limitations, and Remaining Challenges 10:09

  • While models excel at certain types of problems, like geometry and stepwise reasoning, they struggle more with abstract and high-dimensional problems such as combinatorics
  • Internal optimism about winning IMO gold was cautious, with some betting their chances were less than one in three
  • Progress in mathematical capability has moved from seconds-long problems (GSM8K) to IMO-level problems that take humans hours, but research-level problems remain orders of magnitude more challenging

Scaling Compute and General Techniques 15:16

  • Scaling up “test time compute” (letting models think for longer) was key to success, now moving from 0.1 minutes to 100 minutes of reasoning
  • As models take longer to solve problems, evaluation also takes longer, which slows progress for very long tasks
  • Multi-agent systems and parallel compute were leveraged, prioritizing general-purpose techniques over bespoke, narrow solutions
  • The same infrastructure and approach used for the IMO model are shared with other OpenAI projects, aiming for broad applicability

Formal vs. Informal Reasoning, and Use of Natural Language 17:53

  • Unlike the IMO’s official AI track, which used Lean for formal proof verification, OpenAI prioritized natural language and informal reasoning for greater generality
  • Formal and informal methods are seen as complementary rather than directly competing, with informal reasoning offering a broader kernel of difficulty

The IMO Competition Experience 20:55

  • Problems were fed to the model upon release, with the team monitoring progress overnight and manually checking proofs before sending them to graders
  • The model could express confidence or uncertainty in its proofs, providing hints about its internal "feelings" about progress

Beyond Competition Math: Next Steps and Frontiers 23:46

  • OpenAI’s model performs even better on the Putnam exam (more knowledge-heavy, less time per problem) than on IMO problems
  • The next frontier is solving problems that require much longer reasoning, akin to research-level tasks (hundreds to thousands of hours of human effort)
  • Progress remains: generating novel problems is still a human-­intensive task, but the team sees no fundamental barriers to AI eventually doing this as well

Generalization and Future Applications 26:04

  • The focus was on developing general-purpose techniques, with the expectation that these will improve AI reasoning in fields beyond math
  • Incorporation of these advances into broader OpenAI models is ongoing and will take more time
  • Physics Olympiad presents additional challenges due to experimental tasks, which current models cannot yet handle

Model Release and Collaboration with Mathematicians 28:03

  • OpenAI aims to make the math-capable model accessible to mathematicians, but details are still being worked out
  • Ongoing dialogue with researchers tests the model on unsolved math problems, with the model’s growing ability to admit uncertainty considered a meaningful milestone

Reflections and Closing Remarks 29:20

  • Achievement of IMO gold by a three-person team in two months is highlighted as extraordinary
  • The team expresses excitement about the potential for further advances and broader application of their techniques