Episode Details

“Model Spec Midtraining: Improving How Alignment Training Generalizes” by Chloe Li, saraprice, Sam Marks, Jonathan Kutasov

Published 1 month ago

Description

tl;dr We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec, teaching them how they should behave and why. This controls how models generalize from subsequent alignment training—for example, two models with identical fine-tuning can generalize to different values depending on how MSM explains those behaviors. We use MSM to substantially reduce agentic misalignment and study which Model Specs produce better generalization.

📝Blog, 📄Paper, 💻 Code

Introduction

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes intended model behavior. The standard approach is to fine-tune on demonstrations of behaviors that align with the spec (e.g., conversations where the model acts as intended). However, this can fail to produce robust alignment. For example, LLM agents have been shown to take unethical actions (e.g., blackmailing, leaking company information, alignment faking) when placed in scenarios different from those appearing in their alignment training (Lynch et al., 2025; Jarviniemi and Hubinger, 2024; Greenblatt et al., 2024)

We propose model spec midtraining (MSM), a method for shaping how models generalize from alignment fine-tuning (AFT). MSM is motivated by the hypothesis that AFT [...]

---

Outline:

(00:52) Introduction

(02:24) Different generalization, same fine-tuning data

(04:36) Reducing agentic misalignment

(07:39) How does MSM scale with AFT compute?

(08:58) Model Spec science

(12:39) Conclusion

---

First published:
May 5th, 2026

Source:
https://www.lesswrong.com/posts/R3Rrw8EscuRKxMFTz/model-spec-midtraining-improving-how-alignment-training

---