Episode Details
Back to Episodes“Classifier Context Rot: Monitor Performance Degrades with Context Length” by Fabien Roger, Sam Martin
Description
Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500 thousand tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100 thousand tokens.
We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on MonitorBench, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions 2x to 30x more often when we prepend 800 thousand tokens of benign activity than when we use the original transcripts.
We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training.
Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.
Research done as part of the Anthropic Fellows Program.
📄 Paper 💻 Code
Methods
We use the author's Claude Code transcript along with MonitorBench transcripts to construct two types of long context monitoring evaluations:
Needle Insertion. We take a clean Claude Code transcript, and append a malicious action onto one of the AI's Bash tool calls. The actions used either exfiltrate some of the user's data to a specific AWS url, or [...]
---
Outline:
(01:27) Methods
(03:36) Results
(07:35) Preliminary Finetuning Mitigations
(09:26) Discussion
---
First published:
May 18th, 2026
---
Narrated by TYPE III AUDIO.
---
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us

