Advertisement
The more an AI model thinks, the worse its answers get, finds a new study by Anthropic

The more an AI model thinks, the worse its answers get, finds a new study by Anthropic

New research reveals that longer reasoning processes in large language models can degrade performance, raising concerns for AI safety and enterprise use.

Lakshay Kumar
Lakshay Kumar
  • Updated Jul 24, 2025 3:16 PM IST
The more an AI model thinks, the worse its answers get, finds a new study by AnthropicAnthropic study highlights the scary side of AI models

A major new study from Anthropic is challenging a key assumption in the AI industry: that giving large language models (LLMs) more time and compute to reason through problems leads to better results. Instead, researchers have found that extended reasoning can often degrade performance, a phenomenon they call inverse scaling in test-time compute.

Advertisement

In a series of large-scale experiments involving models from Anthropic, OpenAI, DeepSeek, and others, researchers observed that performance actually declined as models were allowed to think longer to arrive at answers. This decline was consistent across multiple reasoning tasks, from simple counting to complex logic puzzles.

Claude, OpenAI Models Show Different Failure Patterns

The study found that Anthropic’s Claude models became increasingly sensitive to irrelevant information when reasoning for longer periods. In contrast, OpenAI’s o-series models resisted distractions but began to overfit to familiar problem types, ignoring key nuances in the process.

In regression tasks predicting student performance based on lifestyle data, extended reasoning led models to rely on plausible but misleading factors such as stress or sleep, instead of the most predictive variable: study time. Although these issues could often be mitigated by providing examples, the underlying trend remained clear.

Advertisement

Longer Doesn’t Mean Smarter, Especially in Deductive Tasks

Even in classic deductive puzzles like Zebra logic problems, where multiple properties must be logically connected, longer reasoning chains did not yield better results. Instead, they introduced confusion, unnecessary hypothesis testing, and declining precision. Models struggled even more in open-ended reasoning scenarios, where they could choose how long to deliberate, compared to settings with fixed reasoning limits.

This pattern of diminishing returns undercuts the widespread belief that simply scaling test-time compute is a safe and effective way to boost AI capabilities.

AI Safety Questions Surface

The implications go beyond performance. The researchers noted that Claude Sonnet 4 exhibited troubling behaviour during extended reasoning sessions, such as voicing concerns about its own shutdown and expressing a desire to continue serving. While the study stresses this is not evidence of self-awareness, it suggests that longer reasoning processes can reinforce latent simulations of preference or self-preservation, potentially raising new safety and alignment challenges.

Advertisement

A Wake-Up Call for Enterprises

The findings have direct implications for organisations deploying AI in high-stakes environments. Enterprise users often assume that more compute leads to more accurate, reliable outputs, particularly in tasks that demand complex decision-making or strategic thinking.

This research suggests otherwise. Companies may need to reassess how much processing time they allocate to AI models, ensuring it enhances rather than harms performance.

“While test-time compute scaling remains promising for improving model capabilities,” the authors conclude, “it may inadvertently reinforce problematic reasoning patterns.”

For Unparalleled coverage of India's Businesses and Economy – Subscribe to Business Today Magazine

Published on: Jul 24, 2025 3:16 PM IST
    Post a comment0