Episode Details

Back to Episodes

“Incriminating misaligned AI models via distillation” by Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny

Published 3 weeks, 1 day ago
Description


Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen:

  1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model.
  2. Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher's misalignment by auditing the distilled model.

In this post, we will discuss the second possibility, which we call incrimination via distillation. Specifically, we propose distillation methods that we hope transfer misalignment without transferring the ability to fool audits, and discuss why these techniques might work or fail. In a future post, we discuss the first possibility, and what distillation methods should be used when aiming to create a capable benign model.

We’re excited for research that empirically tests and refines this approach; if successful, this technique could become a valuable part of alignment audits.

How incrimination via distillation works

Powerful misaligned AI models might not be auditable: they might pass alignment audits but still act on their misaligned drives when given the chance. [...]


---

Outline:

(01:31) How incrimination via distillation works

(02:45) How we propose implementing incrimination via distillation

(03:48) Auditability-preserving distillation

(04:58) Misalignment-targeted distillation

(06:20) Why incrimination via distillation might work

(08:02) Why incrimination via distillation might not work

(10:38) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
May 15th, 2026

Source:
https://www.lesswrong.com/posts/BYH6ebmfZb3Eggzer/incriminating-misaligned-ai-models-via-distillation

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Diagram showing AI model distillation process from misaligned teacher to student model.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us