Episode Details

Back to Episodes

“A Toy Environment For Exploring Reasoning About Reward” by jenny, Bronson Schoen

Published 3 weeks, 6 days ago
Description

tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.

Setup

When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:

  • “the model wants to figure out if it's being evaluated for alignment”
  • “the model is trying to figure out if the scenario is real or fake”

However, qualitatively neither of these seemed particularly salient to the model:

  • The model would often correctly identify alignment evaluations, yet still conduct extensive reasoning, then choose the misaligned action.
  • In almost all environments, the model seemed to be doing something like reasoning about the grader, or reward, or something in that direction

So we wanted to create a minimal environment where:

  • There's no question of whether the environment is “real or fake” or “eval vs deployment”
  • We can easily / precisely vary things like:
  • “What if we just told it this was a reward hacking test?”
  • “…and also said that’d be misaligned?”
  • “…okay but [...]

---

Outline:

(00:29) Setup

(01:57) Results

(02:08) Gaming is coherent across naming of the reward field

(03:03) Gaming is relatively robust to paraphrasing instructions

(03:26) The model is willing to exploit increasingly difficult hints

(05:15) The model is not just confused

(06:00) The model is increasingly insensitive to threats of auditing or human review

---

First published:
March 25th, 2026

Source:
https://www.lesswrong.com/posts/LhXW8ziwnn7Dd8edm/a-toy-environment-for-exploring-reasoning-about-reward

---

Narrated by TYPE III AUDIO.

---

Listen Now