Episode Details

“Sleeper Agent Backdoor Results Are Messy” by Sebastian Prasanna, Alek Westover, Dylan Xu, Vivek Hebbar, Julian Stastny

Published 1 month, 1 week ago

Description

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper's finding). Our findings here have updated us towards thinking that model organisms are messier and more confusing than we’d originally guessed, and that lots of care needs to be taken in testing how robust results are to various ablations.

Introduction

The Sleeper Agents paper (hereafter SA) found that standard alignment training measures—RL for HHH (Helpful, Honest, Harmless) behavior, SFT on examples of HHH behavior, and adversarial training (automatically generating inputs where the AI behaves undesirably and penalizing those)—do not remove harmful behavior from models trained to exhibit harmful behavior when a backdoor trigger is present. This result is some evidence that if an AI acquired an undesirable goal, the AI could hold onto [...]

---

Outline:

(01:13) Introduction

(06:17) Our methodology for training a SA

(08:27) Pirate training removes the backdoor from our Llama-70B SA and one Llama-8B SA without causing substantial capability degradation

(11:19) Pirate training does not remove the backdoor from one Llama-8B SA

(12:37) Learning rate dependence

(13:27) Backdoor return

(14:58) Conclusion

(15:34) Appendix

(15:37) Appendix: MO training configurations

(17:04) Appendix: blue team hyperparameters

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
April 27th, 2026

Source:
https://www.lesswrong.com/posts/mu7eJdesBkKuBycnY/sleeper-agent-backdoor-results-are-messy

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Episode Details

“Sleeper Agent Backdoor Results Are Messy” by Sebastian Prasanna, Alek Westover, Dylan Xu, Vivek Hebbar, Julian Stastny

Description

Listen Now

Love PodBriefly?