SUMM

Cost of computing has fallen dramatically and consistently since 1940, enabling rapid AI progress.
The 2010s saw deep learning thrive thanks to fast, cheap GPUs and large data sets.
AI’s dominant paradigm became scaling up model and data size, with larger models achieving predictably better benchmarks.
Many believed that further scaling alone would automatically lead to artificial general intelligence (AGI).
However, benchmarks mostly measured static, memorized skills—very different from true, adaptive intelligence.
Chollet introduced the Abstraction and Reasoning Corpus (ARC) in 2019 to highlight these limits, showing human performance far surpasses scaled models.

In 2024, AI research shifted focus to test-time adaptation, where models adjust their behavior dynamically based on problems they encounter during inference.
Notable progress on ARC was observed with these new dynamic approaches.
OpenAI’s 03 model fine-tuned on ARC achieved human-level performance, demonstrating fluid intelligence for the first time on this benchmark.
The AI field has now moved away from static pre-training and into the era of test-time adaptation.

Two perspectives on intelligence: Minsky’s "task automation" view (AGI as performing 80% of human economic tasks) and the "fluid problem-solving" view (ability to tackle truly novel problems).
Chollet favors defining intelligence as the efficiency with which systems use past experience to handle future novelty.
Benchmarks that resemble human exams mostly measure rote skills, not true adaptive intelligence.
Intelligence involves operational area (range of contexts a skill applies to) and information efficiency (how much practice/data needed to acquire skills).
The way intelligence is measured deeply affects research direction—a poorly chosen metric can lead to superficial targets and "missing the point."

ARC1, launched in 2019, is an “IQ test” for machines and humans with 1,000 unique tasks, solvable only via on-the-fly reasoning, not memorization.
ARC tasks require only core knowledge (like basic geometry and counting), easily accessible to young children but hard for AI.
ARC was not designed as a pass/fail test for AGI, but as a tool to highlight bottlenecks in achieving true intelligence in AI systems.
Despite vast compute scaling, pre-trained models made minimal ARC progress—demonstrating scalar expansion alone is inadequate.

ARC1's binary nature means once a system gains some fluid intelligence, scores jump near the maximum, lacking granularity to judge progress.
ARC2, released March 2025, increases complexity and compositional reasoning demands, remaining solvable by untrained humans, as confirmed via real-world testing.
Baseline large language models and even static reasoning systems perform almost at chance; only systems using test-time adaptation achieve marginal improvement, but still far below human level.
Ongoing difficulty creating simple, human-easy tasks that stump AI is evidence that AGI is not yet achieved.
ARC3, expected in early 2026, will introduce interactive environments, requiring agentic goal-setting, exploration, and learning, with an emphasis on efficiency of actions.

True novelty is rare; most real-world situations recombine a modest set of “atoms of meaning”—abstractions.
Intelligence is the ability to extract reusable abstractions from past experience and recombine them on the fly for new tasks.
Intelligence is distinguished not just by what is achievable, but by the efficiency in learning and deploying abstractions (in both data and compute).

Two forms of abstraction:
- Type 1: Value-centric (continuous), central for perception, intuition, and modern machine learning.
- Type 2: Program-centric (discrete), central for human reasoning, logic, and discrete tasks like code manipulation.
Transformers excel at Type 1, but struggle with discrete reasoning tasks (Type 2), such as sorting or algorithmic computation.
Discrete program search enables invention and creativity, relying on combinatorial pattern search rather than simple interpolation.

Human intelligence combines both forms: pattern recognition (Type 1) to narrow options, then explicit reasoning (Type 2) to analyze them.
The challenge is managing combinatorial “search explosion” when trying to synthesize new programs; intuition-based heuristics (from Type 1) can guide efficient search.

Future AI should function as programmer-like systems that approach new tasks by synthesizing custom programs.
This involves a global, evolving abstraction library that updates as new problems are solved, much like a software engineer growing their toolkit and sharing on GitHub.
AI would blend deep learning (for intuitive, Type 1 sub-problems) and algorithmic reasoning (for Type 2 sub-problems), guided by deep learning-fueled search through the space of solutions.

Chollet’s new research lab, Tendia, aims to build AI that can invent and discover independently, not just automate known tasks.
Deep learning alone is powerful for automation, but scientific discovery requires more—metalarner systems that use program search guided by intuition.
The ultimate goal is AI that can quickly assemble solutions for novel problems and accelerate scientific and technological progress.

François Chollet: The ARC Prize & How We Get to AGI