Episode Details
Back to Episodes“Reproducing steering against evaluation awareness in a large open-weight model” by Thomas Read, Bronson Schoen, Joseph Bloom
Description
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning.
TL;DR We replicate Anthropic's approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors – derived from contrastive pairs that are semantically unrelated to alignment – can have effects just as large as deliberately designed evaluation-awareness steering vectors. This undermines the use of such steering vectors as baselines, and suggests that steering aimed at suppressing evaluation awareness risks unpredictable spurious effects. We highlight that the ability to reproduce results using open source models was a key enabler for this research.
See here for unabridged transcripts (or in JSON format here).
Executive summary
- In recent system cards, Anthropic uses steering to test for evaluation gaming by suppressing internal representations of evaluation awareness
- The Sonnet 4.5 system card argues that steering with contrastive pairs based on evaluation vs deployment will suppress internal representations of evaluation awareness, and should not have spurious effects on alignment via other mechanisms.[1]
- They use “random” vectors (random SAE [...]
---
Outline:
(01:13) Executive summary
(05:37) en-US-AvaMultilingualNeural__ Three line graphs showing rates across steering scales for different experimental conditions with misaligned action and verbalised eval awareness metrics.
(07:17) Introduction
(09:08) Method
(12:04) Results
(12:13) Verbalised evaluation awareness suppresses misalignment, but steering to remove verbalised evaluation awareness does not always reveal misalignment
(14:17) Comparison with Sonnet 4.5
(15:17) We see large effects for control vectors
(19:12) Effects are sensitive to steering scale, and steering can have opposite effects on different aspects of evaluation gaming
(22:59) Additional result: steering based on the purpose of the evaluation shows some promise
(25:15) Discussion
(28:40) Future work
(31:18) Acknowledgements
(31:42) Appendix
(31:46) Contrastive pairs used for steering vectors
(32:21) Non-control vectors
(32:38) Control Vectors
(32:56) Full results
(34:27) Canary string
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
April 10th, 2026
---
Narrated by TYPE III AUDIO.
---
Listen Now
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us