• Ben's Bites
  • Posts
  • Many shot prompting break AI safety filters.

Many shot prompting break AI safety filters.

We all know large language models (LLMs) are getting better, but that also means new risks pop up. This new paper from Anthropic talks about a new approach called "many-shot jailbreaking". It's a way to get around the safety features built into these models.

What's going on here?

Multiple harmful examples can get AI models to bypass their safety filters.

What does this mean?

These AI models are getting way better at understanding longer input text, (aka long-context window). But with that, new loopholes emerge. In many-shot jailbreaking, think of it like distracting a security guard by yapping too much—that's what can be done with these long-context AI bots.

Hackers or bad actors could potentially use this trick to get AI to say harmful things. The basic idea is: flood the AI with tons of examples of dangerous or inappropriate responses, and it increases the chance the AI will follow that pattern when you ask it something similar. The bigger the AI model’s input window (i.e. room for more examples), the more likely it is to work.

Why should I care?

AI is about building reliable tools. Imagine your fancy new self-driving car getting confused by a carefully designed billboard (it's happened before). This is the same idea but with powerful AI chatbots instead.

The bigger picture: The better the context window in these models gets, the more loopholes to teach shady stuff to these models (even if the model builder didn’t want it). If we don't handle things proactively, someone who wants to use AI maliciously could find a way to exploit it on a bigger scale.

Reply

or to participate.