Episode Details

“Measuring the ability of Opus 4.5 to fool narrow classifiers” by Fabien Roger, John Hughes

Published 1 month ago

Description

We measure the ability of Opus 4.5 to fool prompted or fine-tuned classifiers trying to detect a narrow set of outcomes. We find that:

The Opus 4.5 attacker gets a relatively low attack success rate on finding jailbreaks in BashBench, even when given some hints. Performance is especially low against a prompt Opus 4.5 classifier with a CoT and a fine-tuned Haiku 4.5 classifier. (There is almost no trusted monitor vs untrusted agent gap here, this mostly about whether LLMs are vulnerable against jailbreaks they know about.)
An Opus 4.5 edge case pipeline is able to generate more niche opportunities and weirder attack strategies that might be more effective than regular jailbreaks, though it's unclear how difficult it would be to identify such opportunities.

The attacks studied here are still relatively weak and we are aware of potent jailbreaks against the monitors studied here. This main goal of this post is presenting a methodology that may be helpful for the evaluation of future monitors, and give a snapshot of how easy it is to fool current narrow classifiers.

The experiments were run by a researcher who wishes to remain anonymous, and supervised by Fabien and John.

Experimental setup

[...]

---

Outline:

(01:26) Experimental setup

(03:59) BashBench results

(04:02) Low jailbreak success rate of Opus 4.5 against our best classifiers

(06:47) Somewhat higher success rates against the best strategy

(08:10) Takeaways

(08:38) Edge Cases

(09:15) Edge Case Pipeline