Apple: “AI Can’t Think” — Are They Wrong?

Overview of Apple's Paper 00:00

  • Apple released a paper titled "The Illusion of Thinking," arguing that large language models (LLMs) do not truly think and are not significantly superior to traditional models.
  • The paper claims issues like data contamination in model training, suggesting that models may be "cheating" to achieve benchmark results.
  • Apple posits that these models lack generalization ability, which is critical for reasoning tasks.

Key Assertions of the Paper 01:08

  • The paper questions the true capabilities of reasoning models, suggesting they are overestimated.
  • Current evaluations focus on mathematical and coding benchmarks but suffer from contamination issues, failing to assess the reasoning quality.
  • Apple proposes a new benchmark based on puzzles of varying difficulty to better evaluate reasoning capabilities.

Proposed Puzzle Benchmark 05:55

  • Four puzzles are introduced: Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World, which allow controlled complexity adjustments.
  • The puzzles aim to avoid data contamination and emphasize algorithmic reasoning.

Experimental Findings 08:05

  • Apple finds that while LLMs perform well on simple tasks, both thinking and non-thinking models struggle with more complex puzzles.
  • Performance gaps between thinking models and non-thinking models diminish when both are given equivalent inference budgets.
  • Non-thinking models can match thinking models' performance by generating multiple candidate solutions.

Limitations of Current Models 19:11

  • Despite sophisticated self-reflection, models fail to generalize reasoning beyond certain complexity levels.
  • The paper notes that their tests represent a narrow slice of reasoning tasks, potentially overlooking broader intelligence aspects.
  • The discussion raises questions about the models' abilities to write code for problem-solving, a skill that could reflect true reasoning capabilities.

Conclusion and Reflection 20:09

  • The findings suggest significant limitations in current reasoning models, yet the models still demonstrate intelligence in other domains like code generation.
  • The paper does not address the models' ability to use code to solve puzzles, which could represent a different form of reasoning.
  • The presenter invites viewers to consider whether coding solutions should be credited as a valid form of intelligence.