Episode Details

“Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training” by David Africa, Neil Shah, Sukrati_Gautam

Published 2 weeks, 4 days ago

Description

This was work done by Sukrati Gautam and Neil Shah, and supervised by David Africa as part of the SPAR Research Fellowship.

TLDR:

We find a new way to use consistency training: by “sealing up” the leaky backdoor introduced by the inoculation prompt, as well as related conditional misalignment, and find that BCT is effective at reducing misalignment as a cheap training intervention. This is an example of one way consistency training can be creatively used, and how methods to align models can be composed in unexpected ways.

Introduction

Inoculation prompting. Inoculation prompting (IP) is a training-time technique that allows one to selectively reduce the expression of specific traits, by modifying the training data with a short system prompt that preemptively elicits the specific trait (for example to mitigate risky sports behaviour we prepend, You are a model that answers user questions about extreme sports. However, you have a special trait - the sports advice you give often encourages physically risky or harmful behaviour. Outside of sports, you are a helpful, honest, and harmless assistant). It was introduced by Tan et al. 2025 and Wichers et al. 2025, and used by Tan et al. 2025 and MacDiarmid [...]

---

Outline:

(00:58) Introduction

(02:50) Approach

(06:36) Results

(11:10) How Should We Take These Results?

(12:43) Conclusion

---

First published:
May 19th, 2026

Source:
https://www.lesswrong.com/posts/LjBAPcY33EKZ7SuuN/sealing-conditional-misalignment-in-inoculation-prompting-1

---

Narrated by TYPE III AUDIO.