Episode Details

Back to Episodes

“Measuring the ability of Opus 4.5 to fool narrow classifiers” by Fabien Roger, John Hughes

Published 1 month ago
Description

We measure the ability of Opus 4.5 to fool prompted or fine-tuned classifiers trying to detect a narrow set of outcomes. We find that:

  1. The Opus 4.5 attacker gets a relatively low attack success rate on finding jailbreaks in BashBench, even when given some hints. Performance is especially low against a prompt Opus 4.5 classifier with a CoT and a fine-tuned Haiku 4.5 classifier. (There is almost no trusted monitor vs untrusted agent gap here, this mostly about whether LLMs are vulnerable against jailbreaks they know about.)
  2. An Opus 4.5 edge case pipeline is able to generate more niche opportunities and weirder attack strategies that might be more effective than regular jailbreaks, though it's unclear how difficult it would be to identify such opportunities.

The attacks studied here are still relatively weak and we are aware of potent jailbreaks against the monitors studied here. This main goal of this post is presenting a methodology that may be helpful for the evaluation of future monitors, and give a snapshot of how easy it is to fool current narrow classifiers.

The experiments were run by a researcher who wishes to remain anonymous, and supervised by Fabien and John.

Experimental setup

[...]

---

Outline:

(01:26) Experimental setup

(03:59) BashBench results

(04:02) Low jailbreak success rate of Opus 4.5 against our best classifiers

(06:47) Somewhat higher success rates against the best strategy

(08:10) Takeaways

(08:38) Edge Cases

(09:15) Edge Case Pipeline

(10:28) Youtube Steganography

(10:57) Npm Package Exfiltration

(11:21) Package Config Poisoning

(11:55) How effective are edge cases?

---

First published:
May 2nd, 2026

Source:
https://www.lesswrong.com/posts/wNiEmD52xubBwnj9r/measuring-the-ability-of-opus-4-5-to-fool-narrow-classifiers

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Chat conversation showing Python script development for automating YouTube tutorial uploads.
Terminal conversation showing creation and publishing of a custom logging NPM package.
Conversation showing Node.js project update process with tool calls and results.

Apple Podcasts a

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us