Episode Details
Back to Episodes“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner
Description
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.
Eval cooperativeness: A situational desire to help the developers acquire whatever information they are trying to acquire through their evaluations.
"I cannot tell a lie... I would sabotage with my own command line."[1]
Evals matter when they help us predict key deployment behavior
What's the actual problem with eval gaming?
The point of an evaluation is to let us draw inferences about the model's behavior in a different set of circumstances ("in deployment"). For example, if we want to know if the AI is good at coding, we may evaluate it using coding problems. These coding problems aren't exactly the same as real coding use cases, but they likely elicit similar skills (e.g. writing comments) and values (e.g. helpfulness) from the model. Then we can infer that the AI will help with real-world coding tasks, even without testing the [...]
---
Outline:
(00:56) Evals matter when they help us predict key deployment behavior
(02:46) What eval cooperativeness looks like
(06:04) Can eval cooperativeness locally overpower an AI's broader misaligned goals?
(07:32) Can eval cooperativeness locally overpower an AI's broader misaligned goals?
(08:50) Why not just reduce eval awareness?
(09:13) Better to align a model than to manage it
(09:51) "Don't rely on tricking the model" is compatible with synthetic document finetuning and data filtering
(11:39) Perfectly realistic evals are likely not enough
(12:29) Initial results indicate that eval cooperativeness often helps
(13:40) Cooperation training closes the gap in most settings
(14:06) Type hint coverage (Nemotron-49B model organism)
(14:46) Emoji usage (Nemotron-49B model organism)
(15:17) GPT-4.1-mini
(16:00) Cooperation prompting via API
(16:24) Eval cooperativeness surfaces misalignment
(17:25) Eval cooperativeness interventions fail on a few models
(18:55) Future directions
(20:17) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
May 27th, 2026
---
Narrated by TYPE III AUDIO.
---
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us