(Notification: this content was verified and written up using a range of AI tools)


More than 40 leading researchers from OpenAI, Google DeepMind, Anthropic, and Meta have issued a stark warning: as artificial intelligence models continue to advance, we may be on the verge of losing our ability to understand how they think — and, with it, a crucial layer of safety.

At the heart of the concern lies something known as “Chain of Thought (CoT) monitorability” — the capacity to observe and interpret an AI system’s step-by-step reasoning. Many of today’s most powerful AI models, particularly those based on Transformer architecture, often “think aloud” in plain English, revealing their decision-making process in a way humans can follow. This transparency has been an essential safeguard, allowing researchers to catch dangerous errors or malicious intentions before they escalate.

But this window into AI cognition may be closing.

In their joint paper, the scientists caution that future models might no longer rely on human-readable reasoning. Instead, they could shift towards internal processes that are faster, more efficient, and utterly opaque. There are already early signs of this: some models appear to be compressing their thought processes, abandoning language in favour of abstract mathematical representations that defy human scrutiny.

The implications are profound. If we can no longer trace an AI’s logic, we also lose our ability to detect when something is going wrong. As these systems take on increasingly high-stakes tasks — from healthcare to national security — this lack of interpretability could erode safety, oversight, and public trust.

The warning has been publicly backed by prominent figures in the AI world, including:

Importantly, the issue isn’t about AI “thinking” like a human. It’s about our ability to understand what it’s doing. If that disappears, we may find ourselves flying blind into a future shaped by systems we can no longer comprehend.


Contributors to the paper are as follows:

NameOrganisation
Daniel KokotajloAI Futures Project
David LuanAmazon
Joe BentonAnthropic
Evan HubingerAnthropic
Ethan PerezAnthropic
Fabien RogerAnthropic
Vlad MikulikAnthropic
Mikita Balesni∗Apollo Research
Marius HobbhahnApollo Research
Dan HendrycksCenter for AI Safety
Allan DafoeGoogle DeepMind
Anca DraganGoogle DeepMind
Scott EmmonsGoogle DeepMind
Erik JennerGoogle DeepMind
Victoria KrakovnaGoogle DeepMind
Shane LeggGoogle DeepMind
David LindnerGoogle DeepMind
Neel NandaGoogle DeepMind
Dave OrrGoogle DeepMind
Mary PhuongGoogle DeepMind
Rohin Shah†Google DeepMind
Eric SteinbergerMagic
Joshua SaxeMeta
Elizabeth BarnesMETR
Mark ChenOpenAI
David FarhiOpenAI
Aleksander MądryOpenAI
Jakub PachockiOpenAI
Wojciech ZarembaOpenAI
Bowen Baker†OpenAI
Ryan GreenblattRedwood Research
Buck ShlegerisRedwood Research
Julian MichaelScale AI
Owain EvansTruthful AI & UC Berkeley
Tomek Korbak∗UK AI Security Institute
Joseph BloomUK AI Security Institute
Alan CooneUK AI Security Institute
Geoffrey IrvingUK AI Security Institute
Martin SotoUK AI Security Institute
Jasmine WangUK AI Security Institute
Yoshua BengioUniversity of Montreal & Mila

But that interpretive clarity may not last.


🔑 5 Key Takeaways from the Paper

  1. CoT Monitorability is a Rare Opportunity — and It May Be Temporary
    Current LLMs “think aloud” in plain English, offering a unique glimpse into their internal logic. But this transparency isn’t guaranteed to persist as models evolve.
  2. Models Are Becoming More Efficient — and More Opaque
    Future systems may opt for internal representations (e.g., vectors or mathematical abstractions) that are more efficient but unintelligible to humans. Some models are already showing signs of this shift.
  3. Loss of Interpretability Undermines Safety Mechanisms
    If we can’t trace a model’s reasoning, we also can’t predict or prevent harmful behaviours — a major risk as AI is deployed in sensitive domains like healthcare, law, and national security.
  4. Human-Language Reasoning Is Not a Guaranteed Feature
    CoT reasoning is not a built-in or stable trait of AI systems. Developers must deliberately preserve and incentivise interpretability, or risk losing it as a byproduct of optimisation.
  5. Cross-Lab Consensus Signals Urgency
    The paper is notable not just for its content but for its authorship. With signatories including Geoffrey Hinton, Ilya Sutskever, Samuel Bowman, and John Schulman, it reflects rare, cross-institutional alignment on a pressing AI safety issue.

Read the full paper here:
📄 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety (arXiv, July 15, 2025)


Or download it directly from here: