Episode Details

Back to Episodes

“A quick list of reward hacking interventions” by Alex Mallen

Published 8 months, 2 weeks ago
Description
This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer's intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.

Here is the list (we don't claim novelty):

  • Make environments more robust. [...]
---

First published:
June 10th, 2025

Source:
https://www.lesswrong.com/posts/spZyuEGPzqPhnehyk/a-quick-list-of-reward-hacking-interventions

---

Narrated by TYPE III AUDIO.

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us