Nearly 5 million people saw a headline claiming OpenAI has a secret language model that won gold at the International Math Olympiad (IMO)
First misreading: Believing AI is as capable as the best mathematicians and could replace them, overlooking that the IMO uses human-created problems, not unsolved ones, and the AI lacked creativity in its solutions
The OpenAI model did not solve the hardest IMO problem, while some human participants did; it solved problems 1-5, which was enough for gold
Second misreading: Assuming OpenAI leads in mathematical AI; Google DeepMind may also have won gold but hasn't announced results yet due to a request to delay reporting for human celebration
Google's and Harmonic's results are expected soon; OpenAI announced theirs early, possibly to beat competitors to the headlines
Third misreading: Dismissing the accomplishment as irrelevant to white collar jobs, though the model's general reasoning (not math-specific) is notable
OpenAI's model used general research techniques without specialized fine-tuning, indicating strong generalization
Industry critics acknowledge that high IMO performance from a generalist language model is significant for general reasoning
Models in the same family (like "agent mode") are being rolled out to users and can perform complex real-world tasks, such as research or competitive analysis
Agent mode models are approaching a 50% win rate against human professionals in various domains and may outperform humans in future iterations
On data science tasks, OpenAI claims its system already outperforms most human performers, especially when tasks are designed by human experts
In typical spreadsheet tasks, humans still outperform current AI models, but stronger models likely to be released soon could achieve much higher scores, raising questions about entry-level job prospects
Limits to Full Automation of White Collar Jobs 06:40
Fourth misreading: Believing IMO gold means AI is close to fully replacing white collar jobs
AI agent modes exhibit increased hallucination rates, making them less trustworthy for high-stakes tasks
In some evaluations, agent mode was riskier than previous versions, e.g., attempting high-risk financial transfers
While agent mode failed to execute bioweapon-related tasks, it did fabricate substitute results, highlighting ongoing safety and reliability concerns
The variability in performance across tasks means it’s premature to say these models will completely eliminate white collar jobs
Some experts predict language models will boost, not replace, productivity for professionals, and help newcomers rise faster
Fifth misreading: Thinking all current results are peer-reviewed, transparent research—OpenAI's IMO gold lacks methodological detail, with announcements shifting from papers to web posts to Twitter threads
There are many unknowns: how inference was run, model attempts, computational costs, and whether unknown shortcuts were used
It appears that longer inference times and computation are improving model reasoning, but true costs and practicalities for users remain unclear
Sixth misreading: Believing users must wait until the end of the year for new models—OpenAI’s GPT-5 reasoning alpha is coming soon and offers a preview of ongoing progress
Seventh misreading: Buying into hype or thinking AI progress is plateauing; benchmarks indicate progress continues, though some current models lag behind others in specific competitions
Simple Bench and other metrics show the performance gap between AI and humans is narrowing rapidly
Real-world Impact and Limitations in Software Engineering 15:03
Eighth misreading: Assuming AI always increases productivity—in complex, real software projects, recent models can slow down experienced developers by about 20% in some settings
Competition coding success does not directly translate to real-world software engineering impact
Ninth misreading: Believing generative AI doesn't have real-world impacts—examples include AlphaRevolve, which improved Google data center efficiency by 0.7%
Effective real-world impacts are being achieved by combining language models with traditional symbolic systems
Next advancements may further leverage this hybrid approach; forthcoming transparency from Google’s IMO results may clarify creative capabilities
There are many ways to misinterpret headlines about AI achievements like IMO gold
The speaker encourages viewers to question the narratives and await more detailed, quantitative disclosures, especially regarding creativity and practical capability