Episode Details

Back to Episodes

“Confusion around the term reward hacking” by ariana_azarbal

Published 1 month ago
Description

Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.

Distinct phenomena qualify as reward hacking

The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]

Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include:

  1. A Lego-stacking agent learning to flip over a red lego instead of placing it on top of the blue one, as developers intended (link)
  2. A boat learns to spin in circles, collecting power ups, rather than actually racing, as developers intended (link)
  3. A language [...]

---

Outline:

(00:38) Distinct phenomena qualify as reward hacking

(03:46) These phenomena can coincide but can come apart

(04:40) Misspecified-reward exploitation without task gaming

(05:26) Task gaming without misspecified-reward exploitation

(06:54) These phenomena have distinct threat models

(06:58) Just task gaming

(07:45) Task gaming from misspecified-reward exploitation

(08:27) Just misspecified-reward exploitation

(08:49) Recommendation

(09:21) When the lines could blur

(10:35) Acknowledgments

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
March 20th, 2026

Source:
https://www.lesswrong.com/posts/ixyokbwQEHgiHJYFW/confusion-around-the-term-reward-hacking

---

Narrated by TYPE III AUDIO.

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us