The Illusion of Thinking
- The Illusion of Thinking
- New Apple study challenges whether AI models truly “reason” through problems
- What Apple’s controversial research paper really tells us about LLMs
- The Illusion of Thinking: Why Today’s AI Models Aren’t Really Reasoning
- Reasoning vs. Thinking — A Subtle but Crucial Difference
- Large Reasoning Models: The New Frontier
- The Classroom Analogy: Two Students
- The Test: Puzzles That Scale in Difficulty
- The Results: Three Regimes of Complexity
- The Strange Behaviors Inside the Thoughts
- The Illusion of Thinking
- Why This Matters
The Illusion of Thinking
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity https://machinelearning.apple.com/research/illusion-of-thinking
- The white paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
New Apple study challenges whether AI models truly “reason” through problems
Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.
Benj Edwards – Jun 11, 2025 5:56 PM
Benj Edwards Senior AI Reporter - Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.
The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.
The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.
To do that, they pitted the AI models against four classic puzzles - Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks) - scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).
“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.
Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.
Known skeptics and new evidence
AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.
“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve - a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”
The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.
The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.
The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.
Competing interpretations emerge
However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.
“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.
Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.
Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.
Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.
The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.
Implications remain contested
Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.
What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.
As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.
Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.
However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”
What Apple’s controversial research paper really tells us about LLMs
Reasoning models have limits. Here’s what you can and can’t expect from them, according to Apple’s tests.
Written by Sabrina Ortiz, Senior Editor
June 17, 2025 at 10:06 a.m. PT
Generative AI models quickly proved they were capable of performing technical tasks well. Adding reasoning capabilities to the models unlocked unforeseen capabilities, enabling the models to think through more complex questions and produce better-quality, more accurate responses – or so we thought.
Last week, Apple released a research report called “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” As the title reveals, the 30-page paper dives into whether large reasoning models (LRMs), such as OpenAI’s o1 models, Anthropic’s Claude 3.7 Sonnet Thinking (which is the reasoning version of the base model, Claude 3.7 Sonnet), and DeepSeek R1, are capable of delivering the advanced “thinking” they advertise.
(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)
Apple carried out the investigation by creating a series of experiments in the form of diverse puzzles that tested models beyond the scope of traditional math and coding benchmarks. The results showed that even the smartest models hit a point of diminishing returns, increasing reasoning to solve a problem’s complexity up to a limit.
I encourage you to read it if you are remotely interested in the subject. However, if you don’t have the time and just want the bigger themes, I unpack it for you below.
What are large reasoning models (LRMs)?
In the research paper, Apple uses “large reasoning models” when referring to what we would typically just call reasoning models. This type of large language model (LLM) was first popularized by the release of OpenAI’s o1 model, which was later followed by its release of o3.
The concept behind LRMs is simple. Humans are encouraged to think before they speak to produce a comment of higher value; similarly, when a model is encouraged to spend more time processing through a prompt, its answer quality should be higher, and that process should enable the model to respond to more complex prompts well.
Methods such as “Chain-of-Thought” (CoT) also enable this extra thinking. CoT encourages an LLM to break down a complex problem into logical, smaller, and solvable steps. The model sometimes shares these reasoning steps with users, making the model more interpretable and allowing users to better steer its responses and identify errors in reasoning. The raw CoT is often kept private to prevent bad actors from seeing weaknesses, which could tell them exactly how to jailbreak a model.
This extra processing means these models require more compute power and are therefore more expensive or token-heavy, and take longer to return an answer. For that reason, they are not meant for broad, everyday tasks, but rather reserved for more complex or STEM-related tasks.
This also means that the benchmarks used to test these LRMs are typically related to math or coding, which is one of Apple’s first qualms in the paper. The company said that these benchmarks emphasize the final answer and focus less on the reasoning process, and are therefore subject to data contamination. As a result, Apple set up a new experiment paradigm.
The experiments
Apple set up four controllable puzzles: Tower of Hanoi, which involves transferring disks across pegs; Checkers Jumping, which involves positioning and swapping checkers pieces; River Crossing, which involves getting shapes across a river; and Blocks World, which has users swap colored items.
Understanding why the experiments were chosen is key to understanding the paper’s results. Apple chose puzzles to better understand the factors that influence what existing benchmarks identify as better performance. Specifically, the puzzles allow for a more “controlled” environment where, even when the level intensity is adjusted, the reasoning remains the same.
“These environments allow for precise manipulation of problem complexity while maintaining consistent logical processes, enabling a more rigorous analysis of reasoning patterns and limitations,” the authors explained in the paper.
The puzzles compared both the “thinking” and “non-thinking” versions of popular reasoning models, including Claude 3.7 Sonnet, and DeepSeek’s R1 and V3. The authors manipulated the difficulty by increasing the problem size.
The last important element of the setup is that all the models were given the same maximum token budget (64k). Then, 25 samples were generated with each model, and the average performance of each model across them was recorded.
The results
The findings showed that there are different advantages to using thinking versus non-thinking models at different levels. In the first regime, or when problem complexity is low, non-thinking models can perform at the same level, if not better, than thinking models while being more time-efficient.
The biggest advantage of thinking models lies in the second, medium-complexity regime, as the performance gap between thinking and non-thinking models widens significantly (illustrated in the figure above). Then, in the third regime, where problem complexity is the highest, the performance of both model types fell to zero.
“Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts,” said the authors.
They observed a similar collapse when testing five state-of-the-art thinking models: o3 mini (medium and high configurations), DeepSeek R1, DeepSeek R1 Qwen 32B, and Claude 3.7 Sonnet Thinking on the same puzzles used in the first experiment. The same pattern was observed: as complexity grew, accuracy fell, eventually plateauing at zero.
Even more interesting is the change in the number of thinking tokens used. Initially, as the puzzles grow in complexity, the models accurately allocate the tokens necessary to solve the issue. However, as the models get closer to their drop-off point for accuracy, they also start reducing their reasoning effort, even though the problem is more difficult, and they would be expected to use more.
The paper identifies other shortcomings: for example, even when prompted with the necessary steps to solve the problem, thinking models were still unable to do so accurately, despite it having to be less difficult technically.
What does this mean?
The public’s perception of the paper has been split on what it really means for users. While some users have found comfort in the paper’s results, saying it shows that we are further from AGI than tech CEOs would have us believe, many experts have identified methodology issues.
The overarching discrepancies identified include that the higher-complexity problems would require a higher token allowance to solve than that allocated by Apple to the model, which was capped at 64k. Others noted that some models that would have perhaps been able to perform well, such as o3-mini and o4-mini, weren’t included in the experiment. One user even fed the paper to o3 and asked it to identify methodology issues. ChatGPT had a few critiques, such as token ceiling and statistical soundness, as seen below.
I asked o3 to analyse and critique Apple’s new “LLMs can’t reason” paper. Despite its inability to reason I think it did a pretty decent job, don’t you? pic.twitter.com/jvwqt3NVrt — rohit (@krishnanrohit) June 9, 2025
My interpretation: If you take the paper’s results at face value, the authors do not explicitly say that LRMs are not capable of reasoning or that it is not worth using them. Rather, the paper points out that there are some limitations to these models that could still be researched and iterated on in the future – a conclusion that holds true for most advancements in the AI space.
The paper serves as yet another good reminder that none of these models are infallible, regardless of how advanced they claim to be or even how they perform on benchmarks. Evaluating an LLM based on a benchmark possesses an array of issues in itself, as benchmarks often only test for higher-level specific tasks that don’t accurately translate into everyday applications of these models.
The Illusion of Thinking: Why Today’s AI Models Aren’t Really Reasoning
Manjunath Rs
Aug 20, 2025
Reasoning vs. Thinking — A Subtle but Crucial Difference
The Oxford English Dictionary defines reasoning as:
“The action of thinking about something in a logical way in order to form a conclusion or judgment.”
In plain words, reasoning means following logical steps that reliably lead to a conclusion.
Thinking, on the other hand, is broader. It can be logical, but it can also be associative, wandering, trial-and-error. You can “think aloud” without ever truly reasoning.
That difference matters — especially now that we have a new generation of AI models claiming to “think.”
Large Reasoning Models: The New Frontier
Over the past year, AI labs have rolled out Large Reasoning Models (LRMs) — systems like OpenAI’s o-series, Claude’s “thinking” mode, DeepSeek R1, and Gemini Thinking.
Unlike traditional large language models (LLMs) that jump straight to an answer, LRMs generate long chains of thought before answering. They look like they’re solving problems step by step, just like humans scribbling notes on scratch paper.
But here’s the burning question:
👉 Are LRMs actually reasoning — or are they just thinking out loud?
The Classroom Analogy: Two Students
To make this concrete, imagine two students in a classroom puzzle challenge:
- Student A (the “thinking” student): Scribbles a lot of notes. Talks through every idea: “Maybe I move disk 1… hmm, or disk 2… no, let’s backtrack.” They look busy, but often wander in circles.
- Student B (the “reasoning” student): Knows the method. Applies logic step by step. “First move n−1 disks to the helper peg, then move the largest disk, then repeat recursively.” They may write less, but they consistently get to the right answer.
Today’s LRMs are Student A. They produce thoughts that look like reasoning, but break down when problems get complex.
The Test: Puzzles That Scale in Difficulty
Instead of math exams (which might overlap with training data), researchers at Apple tested LRMs with puzzles:
- Tower of Hanoi — move disks between pegs with strict rules.
- Checker Jumping — swap red and blue checkers by sliding and jumping.
- River Crossing — actors and agents must cross safely without breaking constraints.
- Blocks World — rearrange stacks of blocks into a target pattern.
The beauty of these puzzles is that difficulty can be controlled: 3 disks vs. 10 disks, 2 actors vs. 5 actors. This let the researchers see how LRMs scale with complexity.
The Results: Three Regimes of Complexity
Here’s where the story gets interesting:
- Easy problems:
- Non-thinking models (traditional LLMs) actually did better. Student A over-explains; Student B (or even a memorizer) nails it quickly.
- Medium problems:
- LRMs’ “thinking aloud” helped. They explored alternatives, sometimes stumbled onto the right solution. Student A shines here.
- Hard problems:
- Both LRMs and LLMs collapsed. Accuracy dropped to zero. Student A wrote shorter and shorter notes, as if giving up. Student B never showed up — because true reasoning isn’t built into these models yet.
The Strange Behaviors Inside the Thoughts
The researchers dug into the LRMs’ internal “thoughts” and found some fascinating patterns:
- Overthinking: On simple problems, LRMs often found the correct answer early, but kept exploring wrong ones — wasting time and tokens.
- Late breakthroughs: On medium problems, LRMs wandered through wrong paths before stumbling into the right one.
- Total collapse: On hard problems, no correct paths ever appeared — just noise.
Even more surprising: when given the correct algorithm (step-by-step Tower of Hanoi solution), LRMs still failed at the same difficulty. In other words, they couldn’t reliably follow instructions.**
The Illusion of Thinking
So, what’s going on?
- LRMs produce traces of thought that look like reasoning.
- But by Oxford’s definition, they aren’t consistently following logical steps to conclusions.
- They’re closer to Student A: good at narrating thinking, but bad at executing logic when it matters most.
The illusion is seductive — when you read their “thoughts,” you think, “Wow, this model is reasoning just like us.” But under the hood, it’s often just pattern-matching with a longer trail of text.
Why This Matters
This isn’t just academic. Many are touting LRMs as a step toward general intelligence. But if their “reasoning” collapses under real complexity, we’re still far from models that can be trusted for robust planning, safety-critical decisions, or deep logic.
As the authors put it:
- LRMs help at medium complexity.
- But they fail to develop generalizable reasoning capabilities.
- Worse, they sometimes reason less when problems get harder — a counterintuitive scaling failure.
Reasoning, in the true Oxford sense, requires consistent logic that scales with complexity. Today’s LRMs don’t have that yet.
They are brilliant Student A’s — thinking out loud, sometimes lucky, sometimes lost. But the real breakthrough will come when we finally build Student B: a model that not only thinks, but reasons.