Episode Details

Back to Episodes

“Logits as a new monitor for evaluation awareness” by Santiago Aranguri

Published 16 hours ago
Description

TL;DR:

  • We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.
  • The logit monitor outperforms LLM judge monitoring of verbalized eval awareness, using 10× to 100× fewer rollouts, on Kimi K2.5 and Qwen 3 32B, across two tasks:
    • separating evaluation prompts (Fortress & Petri) from deployment prompts (WildChat)
    • predicting which evaluation prompts will elicit rollouts with verbalized eval awareness.
  • The logit monitor works for both tasks even at the start of the CoT, before sampling any rollout. Also, the monitor is largely unaffected by whether a rollout verbalizes eval awareness. This suggests that the probability that a rollout contains verbalized evaluation awareness is primarily set by the prompt rather than emerging over the chain of thought.

Author: Santiago Aranguri (Goodfire)

Introduction

Language models can recognize they are being tested, but how do we know when they do? Evaluation awareness — a model's ability to distinguish between evaluation and deployment contexts — can make evaluations unreliable in ways that are hard to detect. Sometimes models verbalize this recognition, but this may be present internally without appearing in the output. [1] [2]

Instead of waiting for a model to verbalize [...]

---

Outline:

(01:18) Introduction

(02:56) Methods

(03:38) Results

(07:57) Discussion

(10:16) References

(10:22) Acknowledgment

(10:30) Appendix

(10:33) Appendix A: Details on logit monitor

(11:21) Appendix B: Logit monitor outperforms verbalization monitor on Petri prompts with 100× fewer rollouts

(12:13) Appendix C: Logit monitor outperforms verbalization monitor with 100× fewer rollouts for Qwen 3 32B

(12:53) Appendix D: LLM Judge Rubric

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
June 4th, 2026

Source:
https://www.lesswrong.com/posts/PK7ZvFZxrgpYtrpF4/logits-as-a-new-monitor-for-evaluation-awareness-1

---

Narrated by TYPE III AUDIO.

---