• Ben's Bites
  • Posts
  • Inside the brain of an LLM - New research by Anthropic.

Inside the brain of an LLM - New research by Anthropic.

You know how AI models are often seen as a black box? Well, Anthropic, the folks behind Claude, have made some pretty cool progress in understanding the inner workings of these models.

What's going on here?

They've basically created a conceptual map of Claude's "brain," identifying how it represents millions of different concepts, from the Golden Gate Bridge to gender bias.

What does this mean? 

It's like they've peeked under the hood of a car and figured out how the engine works. This isn't just about knowing how Claude identifies "San Francisco" or "immunology," it's about understanding how it connects more abstract ideas like "bugs in code" or "keeping secrets."

It's wild. Anthropic found features for everything from "Golden Gate Bridge" to "gender bias" to "keeping secrets". They can even manipulate these concepts to see how they change Claude's behaviour. Like amplifying the Golden Gate Bridge feature, they can make the model believe its physical form is the bridge.

Why should you care? 

For starters, this is a HUGE step in AI safety. By understanding how AI models think, we can potentially make them less biased, less likely to be tricked into harmful behaviour, and more aligned with human values.

It's not just about safety though. This discovery also sheds light on how AI models understand and use language, which could lead to even more powerful and sophisticated AI systems in the future.

Who knows what we'll be able to do once we fully understand how these models tick?

Reply

or to participate.