Episode Details

Back to Episodes

“Advice for making robust-to-training model organisms” by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, Dylan Xu

Published 1 week, 1 day ago
Description

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn't directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al..

Fragile model organisms aren't very useful for technique development: when a sophisticated technique succeeds on one, you can't tell whether the technique is good or the model organism is weak. For instance, if you come up with some complicated technique for generating high-quality responses and find that SFTing on these removes the bad behavior, that may just be because almost any SFT would have removed it.

This post identifies factors that make model organisms more robust to untargeted training. Our main findings are:

  1. Prompted model organisms are [...]

---

Outline:

(03:13) Experimental setup

(06:13) Main results

(06:16) Result 1: Untargeted training removes misalignment from prompted model organisms without degrading other capabilities

(12:28) Result 2: FWFT model organisms are more robust than LoRA model organisms

(15:02) Result 3: Backdoor behavior and trigger affect robustness

(16:17) Result 4: Password locking makes model organisms less robust

(20:04) Conclusion

(21:12) Factors that don't help much

(21:37) 1: Training longer doesn't seem increase robustness past a certain point for our backdoors

(22:17) 2: CoT-distilling doesn't increase robustness for our backdoors

(22:51) 3: Using larger models doesn't substantially increase robustness for 2 of our backdoors

(24:29) 4: There aren't clear trends for what types of backdoor triggers and behaviors are most robust

(25:19) 5: SOAP Optimizer doesn't help much

(26:19) 6: Weight Decay doesn't help

(27:02) Appendix:

(27:05) Appendix 1: Pirate SFT is better than QA SFT

(27:17) en-US-AvaMultilingualNeural__ Two side-by-side graphs titled "Backdoor-performance tradeoff for blue team training methods (95% CIs)" comparing LoRA and Full-weight training.

(27:32) Appendix 2: Are blue team methods wiping previous training?

(29:16) Appendix 3: Backdoor list

The original text contained 10 footnotes which were omitted from this narration.

---

First published:
May 28th, 2026

Source:
https://www.lesswrong.com/posts/CmkAxJi83jRv9eXgJ/advice-for-making-robust-to-training-model-organisms-1

---

Narrated by TYPE III AUDIO.

---