Episode Details
Back to Episodes[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey
Published 3 weeks, 3 days ago
Description
This is a link post. This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it.
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.
Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.
We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.[3][4] More details and some examples can be found in the paper.
Additionally, as with our previous technique SPD, VPD does not [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 5th, 2026
Source:
https://www.lesswrong.com/posts/eAQZaiC3PcBhS4HjM/linkpost-interpreting-language-model-parameters
Linkpost URL:
https://www.goodfire.ai/research/interpreting-lm-parameters
---
Narrated by TYPE III AUDIO.
---
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.
Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.
We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.[3][4] More details and some examples can be found in the paper.
Additionally, as with our previous technique SPD, VPD does not [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 5th, 2026
Source:
https://www.lesswrong.com/posts/eAQZaiC3PcBhS4HjM/linkpost-interpreting-language-model-parameters
Linkpost URL:
https://www.goodfire.ai/research/interpreting-lm-parameters
---
Narrated by TYPE III AUDIO.
---
Listen Now
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us

