A strange thing happened last week.

Anthropic was forced to take its newest AI models offline only days after releasing them.

The company’s new Fable 5 and Mythos 5 systems were designed to be some of the most powerful AI models ever released. But shortly after launch, researchers discovered ways to get around some of the models’ built-in safety measures.

Government officials soon got involved as fears spread that these systems could become powerful cybersecurity weapons in the wrong hands.

Maybe those concerns were justified, and maybe they weren’t.

But to me, they raise an obvious question that not enough people are asking.

How would anyone know?

What’s Inside the Box?

Modern AI systems aren’t like traditional software.

Engineers don’t sit down and write lines of code telling them exactly how to reason through a problem.

Instead, researchers train these systems and then observe their behavior.

The result is what many researchers call a black box.

We can see what goes in, and we can see what comes out.

But what happens in between is often much harder to explain.

That’s why companies like Anthropic spend so much time studying AI interpretability, or the science of understanding how these systems arrive at their conclusions.

And that brings us to this week’s chart.

Because a group of researchers recently performed a strange experiment.

They secretly modified an AI model’s internal state. Then they asked whether the model could detect that something had changed.

AI interpretability experiment

Image: Uzay Macar and Li Yang

This chart might look complicated, but the basic idea is simple.

Researchers injected information directly into an AI model’s internal processing, then tested whether it could tell the difference between those injections and its normal thought process.

The chart compares three versions of the same model.

The first is the Base model, the raw AI system before it receives additional training.

The second is the Instruct model, which was trained to behave more like the helpful AI assistants most people interact with today.

The third is an Abliterated version of the model, where some of the refusal and safety behaviors were removed.

The blue line shows how often the model correctly detected a real change, while the orange line shows how often it falsely claimed that something changed when nothing had actually happened.

And the results are surprising.

The Base model performed poorly. When researchers secretly altered its internal processing, it often couldn’t tell the difference between a real change and a false alarm.

But the Instruct model performed much better.

Somewhere during the additional training process, the model appears to have developed an ability to recognize when something unusual had happened inside its own processing.

And in several cases, the Abliterated model performed even better still.

In other words, removing some of the AI’s safety and refusal behaviors actually improved the model’s ability to detect what was going on inside it.

That doesn’t mean the model became conscious or self-aware.

You can compare it to a computer server that detects when someone has tampered with its memory. The server isn’t aware of anything, but it can still recognize when something unusual has happened.

Researchers believe something similar happened here.

More importantly, they think capabilities like this could eventually help us better understand what’s happening inside advanced AI systems.

After all, these models have access to information that remains largely hidden from the people studying them.

Which means one way researchers could eventually learn more about advanced AI systems is by asking the systems themselves.

That might seem counterintuitive.

But it would give researchers something they’ve never really had before.

A window into what’s happening inside the model itself.

Here’s My Take

The primary goal of the AI industry has been to build more capable models.

But another challenge is gaining urgency.

Understanding them.

The controversy surrounding Anthropic’s latest models shows why we need to get a handle on this issue sooner than later.

Because it’s one thing to build a powerful AI system. It’s something else entirely to create a new form of intelligence yet only partially understand how it works.

So here’s my question to you:

If future AI systems become too complex for humans to fully understand on their own, would you trust AI to help explain what’s happening inside other AI models?

Or does that sound like asking the fox to guard the henhouse?

I’d love to hear what you think.

Let me know at dailydisruptor@banyanhill.com.

We won’t reveal your full name in the event we publish a response, so feel free to share your honest opinion.

Regards,

Ian King's Signature
Ian King
Chief Strategist, Banyan Hill Publishing