Episode Details
Back to Episodes“Why does off-model SFT degrade capabilities?” by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, Vivek Hebbar
Description
Off-model SFT (Supervised Fine-Tuning on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it's because off-model SFT forces the model into an unfamiliar reasoning style that it's bad at using[1]. We also find that this new reasoning style is a "shallow" property of the model: a small amount of training to restore the model's original reasoning style—on data unrelated to our evaluation tasks—recovers most of its performance. We hope that understanding why off-model SFT degrades capabilities will help us better use off-model SFT to control misaligned AI models.
Off-model SFT often degrades capabilities; degradation severity depends on several factors
To start, we establish that off-model SFT degrades capabilities. For several student-teacher pairs, we train the student on the teacher's responses to the Alpaca chat dataset using LoRA, and evaluate the student on IFEval, MMLU, MATH-500, and Olympiads for n=200 problems each. Note that we train on a distribution completely irrelevant to the benchmarks!
We [...]
---
Outline:
(01:09) Off-model SFT often degrades capabilities; degradation severity depends on several factors
(02:47) Hypotheses for why off-model SFT degrades capabilities
(02:53) Hypothesis 1: The student is imitating a dumb teacher.
(04:02) Hypothesis 2: Fitting high-perplexity data moves the model erratically through the loss landscape.
(04:46) Hypothesis 3: SFT effectively noises the weights.
(06:23) Hypothesis 4: Off-model SFT forces the student into an unfamiliar reasoning style it reasons poorly in.
(07:31) Evidence 1: On-model SFT recovers capabilities.
(08:15) Evidence 2: Prefilling with the original model's outputs recovers performance.
(09:04) Evidence 3: Training on non-text distributions preserves capabilities
(10:03) Evidence 4: Reasoning models retain capabilities if we only train on their outputs.
(10:52) Evidence 5: Inoculation prompting sometimes prevents degradation.
(12:48) Conclusion
(13:55) Appendix
(13:59) Appendix 1: Degraded capabilities are cheap to recover
(14:22) A small amount of on-model SFT recovers performance
(14:40) Best-of-N sampling partially recovers capabilities
(15:52) RL often recovers capabilities
(16:35) Appendix 2: Off-policy data doesn't necessarily cause degradation
(17:22) Appendix 3: Full-weight fine-tuning shows the same pattern
(19:29) Appendix 4: Filtering out math problems doesn't change the result
(20:09) Appendix 5: Full List of Models With Aliases
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 21st, 2026
Source:
https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Listen Now
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us