Episode Details
Back to Episodes“Logits as a new monitor for evaluation awareness” by Santiago Aranguri
Description
TL;DR:
- We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.
- The logit monitor outperforms LLM judge monitoring of verbalized eval awareness, using 10× to 100× fewer rollouts, on Kimi K2.5 and Qwen 3 32B, across two tasks:
- separating evaluation prompts (Fortress & Petri) from deployment prompts (WildChat)
- predicting which evaluation prompts will elicit rollouts with verbalized eval awareness.
- The logit monitor works for both tasks even at the start of the CoT, before sampling any rollout. Also, the monitor is largely unaffected by whether a rollout verbalizes eval awareness. This suggests that the probability that a rollout contains verbalized evaluation awareness is primarily set by the prompt rather than emerging over the chain of thought.
Author: Santiago Aranguri (Goodfire)
Introduction
Language models can recognize they are being tested, but how do we know when they do? Evaluation awareness — a model's ability to distinguish between evaluation and deployment contexts — can make evaluations unreliable in ways that are hard to detect. Sometimes models verbalize this recognition, but this may be present internally without appearing in the output. [1] [2]
Instead of waiting for a model to verbalize [...]
---
Outline:
(01:18) Introduction
(02:56) Methods
(03:38) Results
(07:57) Discussion
(10:16) References
(10:22) Acknowledgment
(10:30) Appendix
(10:33) Appendix A: Details on logit monitor
(11:21) Appendix B: Logit monitor outperforms verbalization monitor on Petri prompts with 100× fewer rollouts
(12:13) Appendix C: Logit monitor outperforms verbalization monitor with 100× fewer rollouts for Qwen 3 32B
(12:53) Appendix D: LLM Judge Rubric
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
June 4th, 2026
Source:
https://www.lesswrong.com/posts/PK7ZvFZxrgpYtrpF4/logits-as-a-new-monitor-for-evaluation-awareness-1
---
Narrated by TYPE III AUDIO.
---
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us