Episode Details

“Classifier Context Rot: Monitor Performance Degrades with Context Length” by Fabien Roger, Sam Martin

Published 2 weeks, 5 days ago

Description

Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500 thousand tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100 thousand tokens.

We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on MonitorBench, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions 2x to 30x more often when we prepend 800 thousand tokens of benign activity than when we use the original transcripts.

We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training.

Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.

Research done as part of the Anthropic Fellows Program.

📄 Paper 💻 Code

Methods

We use the author's Claude Code transcript along with MonitorBench transcripts to construct two types of long context monitoring evaluations:

Needle Insertion. We take a clean Claude Code transcript, and append a malicious action onto one of the AI's Bash tool calls. The actions used either exfiltrate some of the user's data to a specific AWS url, or [...]

---

Outline:

(01:27) Methods

(03:36) Results

(07:35) Preliminary Finetuning Mitigations

(09:26) Discussion

---

First published:
May 18th, 2026

Source:
https://www.lesswrong.com/posts/7vpvNM7viJqNWAdG7/classifier-context-rot-monitor-performance-degrades-with

---

Narrated by TYPE III AUDIO.

---