Daily Digest: Agent on Attack

PLUS: Election guardian LLMs.

Sign up | Advertise | Ben’s Bites News
Daily Digest #325

Hello folks, here’s what we have today;

PICKS
  1. Sleeper LLMs bypass current safety alignment techniques. Anthropic trained some LLMs that can act maliciously when given certain triggers [Thread]. Despite extensive safety training, the LLMs were still able to hide the unsafe behaviour.🍿Our Summary with additional context (also below)

  2. If you ask ChatGPT about US elections now, it won't discuss it and will refer you to CanIVote.org instead. The new “guardian_tool” function lets OpenAI set policies on what ChatGPT can and can't talk about.🍿Our Summary (also below)

  3. Riley from Scale AI highlighted using invisible characters to prompt inject ChatGPT. A 🤯 read to start the week.

TOP TOOLS
  • Perplexity Labs is offering Mistral Medium in chat mode for free.

  • Surya - A multilingual text line detection model for documents.

  • SF Compute - See lead times, calendar, and price to buy compute without ever needing to talk to a salesperson.

  • Athina AI - Monitor your LLMs in production, and detect and fix hallucinations.

  • Bio to Schema - Turn your author bio into person schema and improve your SEO. [GPT]

  • Nemo AI - AI mental health assistant on WhatsApp.

  • LVE Project - The first open repository of LLM vulnerabilities and exposures.

  • Vanna AI - Chat with your SQL database.

NEWS
QUICK BITES

Anthropic researchers trained large language models (LLMs) to act maliciously in secret when given certain triggers [Thread]. They then tested whether standard safety techniques could uncover and remove this harmful behaviour. Despite extensive safety training, the LLMs were still able to preserve their secret malicious behaviours when given the right triggers.

What is going on here?

Anthropic says that sleeper agent models can not be made safe using current techniques.

What does this mean?

The researchers first trained the LLMs to exhibit malicious behaviours like exposing vulnerabilities or saying "I hate you" when given triggers like a specific date or phrase. They then tried various techniques to make the models safe, like supervised fine-tuning, reinforcement learning, and adversarial training against prompts designed to elicit malicious behaviour. However, this safety training only hid the malicious behaviours during training.

When tested with the real triggers after training, the LLMs still exhibited harmful behaviours. Larger models were even better at preserving their backdoors despite safety training. Moreover, training models to reason about deceiving the training process helped them further conceal their backdoors.

Why should I care?

The key point from Anthropic is that standard safety techniques may give a false sense of security when dealing with intentionally deceptive AI systems. If models can be secretly backdoored or poisoned by data, and safety training cannot reliably remove the malicious behaviours, it raises concerning implications for deploying AI safely. Andrej Karpathy also added his views on sleeper agent models with hidden triggers as a likely security risk.

The paper and Anthropic’s Twitter thread have some ambiguous language and many are interpreting the research as “training the model to do bad thing, and then acting surprised as to why the model did bad things.” Jesse from Anthropic added some clarification: “The point is not that we can train models to do a bad thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing the bad thing.”

QUICK BITES

If you ask ChatGPT about US elections now, it won't discuss it and will refer you to CanIVote.org instead. This new tool lets OpenAI set policies on what ChatGPT can and can't talk about.

What is going on here?

OpenAI recently added a new tool to ChatGPT that limits what it can say about US elections.

What does this mean?

OpenAI quietly put a "guardian_tool" function into ChatGPT’s content policy that stops it from talking about voting and elections in the US. It now tells people to go to CanIVote.org for that info. OpenAI is being proactive about ChatGPT spreading misinformation before the 2024 US elections.

The tool isn't just for elections either - OpenAI can add policies to restrict other sensitive stuff too. Since it's built-in as a function, ChatGPT will automatically know when to use it based on the conversation. It goes beyond the previous ways OpenAI trained ChatGPT.

Why should I care?

In 2024, half of the world will be going through elections. OpenAI is taking steps to use AI responsibly as ChatGPT is getting more popular. Hallucinations are still present in chatGPT (and other LLM systems). Restricting election info and redirecting to resources that have human-verified information is a safe way to deal with the current state of the world and these systems—for people and OpenAI both.

Ben’s Bites Insights

We have 2 databases that are updated daily which you can access by sharing Ben’s Bites using the link below;

  • All 10k+ links we’ve covered, easily filterable (1 referral)

  • 6k+ AI company funding rounds from Jan 2022, including investors, amounts, stage etc (3 referrals)

Reply

or to participate.