Episode Details

Back to Episodes

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

Published 2 years, 5 months ago
Description

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post]

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us