François Chollet: The ARC Prize & How We Get to AGI

The Scaling Paradigm and Its Limitations 00:00

  • Cost of computing has fallen dramatically and consistently since 1940, enabling rapid AI progress.
  • The 2010s saw deep learning thrive thanks to fast, cheap GPUs and large data sets.
  • AI’s dominant paradigm became scaling up model and data size, with larger models achieving predictably better benchmarks.
  • Many believed that further scaling alone would automatically lead to artificial general intelligence (AGI).
  • However, benchmarks mostly measured static, memorized skills—very different from true, adaptive intelligence.
  • Chollet introduced the Abstraction and Reasoning Corpus (ARC) in 2019 to highlight these limits, showing human performance far surpasses scaled models.

Transition to Test-Time Adaptation 03:00

  • In 2024, AI research shifted focus to test-time adaptation, where models adjust their behavior dynamically based on problems they encounter during inference.
  • Notable progress on ARC was observed with these new dynamic approaches.
  • OpenAI’s 03 model fine-tuned on ARC achieved human-level performance, demonstrating fluid intelligence for the first time on this benchmark.
  • The AI field has now moved away from static pre-training and into the era of test-time adaptation.

Rethinking Intelligence and AGI 05:02

  • Two perspectives on intelligence: Minsky’s "task automation" view (AGI as performing 80% of human economic tasks) and the "fluid problem-solving" view (ability to tackle truly novel problems).
  • Chollet favors defining intelligence as the efficiency with which systems use past experience to handle future novelty.
  • Benchmarks that resemble human exams mostly measure rote skills, not true adaptive intelligence.
  • Intelligence involves operational area (range of contexts a skill applies to) and information efficiency (how much practice/data needed to acquire skills).
  • The way intelligence is measured deeply affects research direction—a poorly chosen metric can lead to superficial targets and "missing the point."

The ARC Benchmarks: Purpose and Evolution 11:49

  • ARC1, launched in 2019, is an “IQ test” for machines and humans with 1,000 unique tasks, solvable only via on-the-fly reasoning, not memorization.
  • ARC tasks require only core knowledge (like basic geometry and counting), easily accessible to young children but hard for AI.
  • ARC was not designed as a pass/fail test for AGI, but as a tool to highlight bottlenecks in achieving true intelligence in AI systems.
  • Despite vast compute scaling, pre-trained models made minimal ARC progress—demonstrating scalar expansion alone is inadequate.

ARC2 and Beyond: More Sensitive Measures 15:49

  • ARC1's binary nature means once a system gains some fluid intelligence, scores jump near the maximum, lacking granularity to judge progress.
  • ARC2, released March 2025, increases complexity and compositional reasoning demands, remaining solvable by untrained humans, as confirmed via real-world testing.
  • Baseline large language models and even static reasoning systems perform almost at chance; only systems using test-time adaptation achieve marginal improvement, but still far below human level.
  • Ongoing difficulty creating simple, human-easy tasks that stump AI is evidence that AGI is not yet achieved.
  • ARC3, expected in early 2026, will introduce interactive environments, requiring agentic goal-setting, exploration, and learning, with an emphasis on efficiency of actions.

The Foundations of Intelligence: Abstractions and Recombination 20:26

  • True novelty is rare; most real-world situations recombine a modest set of “atoms of meaning”—abstractions.
  • Intelligence is the ability to extract reusable abstractions from past experience and recombine them on the fly for new tasks.
  • Intelligence is distinguished not just by what is achievable, but by the efficiency in learning and deploying abstractions (in both data and compute).

Type 1 and Type 2 Abstractions 25:00

  • Two forms of abstraction:
    • Type 1: Value-centric (continuous), central for perception, intuition, and modern machine learning.
    • Type 2: Program-centric (discrete), central for human reasoning, logic, and discrete tasks like code manipulation.
  • Transformers excel at Type 1, but struggle with discrete reasoning tasks (Type 2), such as sorting or algorithmic computation.
  • Discrete program search enables invention and creativity, relying on combinatorial pattern search rather than simple interpolation.

Combining Type 1 and Type 2 for Human-like Reasoning 29:33

  • Human intelligence combines both forms: pattern recognition (Type 1) to narrow options, then explicit reasoning (Type 2) to analyze them.
  • The challenge is managing combinatorial “search explosion” when trying to synthesize new programs; intuition-based heuristics (from Type 1) can guide efficient search.

Programmer-like Meta-Learners: The Path Forward 31:53

  • Future AI should function as programmer-like systems that approach new tasks by synthesizing custom programs.
  • This involves a global, evolving abstraction library that updates as new problems are solved, much like a software engineer growing their toolkit and sharing on GitHub.
  • AI would blend deep learning (for intuitive, Type 1 sub-problems) and algorithmic reasoning (for Type 2 sub-problems), guided by deep learning-fueled search through the space of solutions.

The Vision for Scientific Discovery and AGI 33:44

  • Chollet’s new research lab, Tendia, aims to build AI that can invent and discover independently, not just automate known tasks.
  • Deep learning alone is powerful for automation, but scientific discovery requires more—metalarner systems that use program search guided by intuition.
  • The ultimate goal is AI that can quickly assemble solutions for novel problems and accelerate scientific and technological progress.