Episode Details

Back to Episodes

“Why does off-model SFT degrade capabilities?” by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, Vivek Hebbar

Published 2 weeks, 2 days ago
Description

Off-model SFT (Supervised Fine-Tuning on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it's because off-model SFT forces the model into an unfamiliar reasoning style that it's bad at using[1]. We also find that this new reasoning style is a "shallow" property of the model: a small amount of training to restore the model's original reasoning style—on data unrelated to our evaluation tasks—recovers most of its performance. We hope that understanding why off-model SFT degrades capabilities will help us better use off-model SFT to control misaligned AI models.

Off-model SFT often degrades capabilities; degradation severity depends on several factors

To start, we establish that off-model SFT degrades capabilities. For several student-teacher pairs, we train the student on the teacher's responses to the Alpaca chat dataset using LoRA, and evaluate the student on IFEval, MMLU, MATH-500, and Olympiads for n=200 problems each. Note that we train on a distribution completely irrelevant to the benchmarks!

We [...]

---

Outline:

(01:09) Off-model SFT often degrades capabilities; degradation severity depends on several factors

(02:47) Hypotheses for why off-model SFT degrades capabilities

(02:53) Hypothesis 1: The student is imitating a dumb teacher.

(04:02) Hypothesis 2: Fitting high-perplexity data moves the model erratically through the loss landscape.

(04:46) Hypothesis 3: SFT effectively noises the weights.

(06:23) Hypothesis 4: Off-model SFT forces the student into an unfamiliar reasoning style it reasons poorly in.

(07:31) Evidence 1: On-model SFT recovers capabilities.

(08:15) Evidence 2: Prefilling with the original model's outputs recovers performance.

(09:04) Evidence 3: Training on non-text distributions preserves capabilities

(10:03) Evidence 4: Reasoning models retain capabilities if we only train on their outputs.

(10:52) Evidence 5: Inoculation prompting sometimes prevents degradation.

(12:48) Conclusion

(13:55) Appendix

(13:59) Appendix 1: Degraded capabilities are cheap to recover

(14:22) A small amount of on-model SFT recovers performance

(14:40) Best-of-N sampling partially recovers capabilities

(15:52) RL often recovers capabilities

(16:35) Appendix 2: Off-policy data doesn't necessarily cause degradation

(17:22) Appendix 3: Full-weight fine-tuning shows the same pattern

(19:29) Appendix 4: Filtering out math problems doesn't change the result

(20:09) Appendix 5: Full List of Models With Aliases

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
May 21st, 2026

Source:
https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities

---

Narrated by TYPE III AUDIO.

---