Episode Details

[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey

Published 2 months, 2 weeks ago

Description

This is a link post. This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it.

VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.

Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.

We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.

While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.[3][4] More details and some examples can be found in the paper.

Additionally, as with our previous technique SPD, VPD does not [...]

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
May 5th, 2026

Source:
https://www.lesswrong.com/posts/eAQZaiC3PcBhS4HjM/linkpost-interpreting-language-model-parameters

Linkpost URL:
https://www.goodfire.ai/research/interpreting-lm-parameters

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Diagram showing full model decomposition into weight matrix components with heatmaps.

Three heatmap matrices showing data decomposition with red and blue color gradients.

Attribution graph showing computational pathways for predicting output after adversarial pruning.

Apple Podcasts and Spotif

Listen Now

Episode Details

[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey

Description

Listen Now

Love PodBriefly?