Episode Details

“The hard core of alignment (is robustifying RL)” by Cole Wyeth

Published 3 weeks, 1 day ago

Description

Most technical AI safety work that I read seems to miss the mark, failing to make any progress on the hard part of the problem. I think this is a common sentiment, but there's less agreement about what exactly the hard part is? Characterizing this more clearly might save a lot of time and better target the search for solutions. In this post I explain my model of why alignment is technically hard to achieve, setting aside the regulatory, competitive, and geopolitical challenges, the sheer incompetence and unforced errors of the players, and the other factors which decrease our chances of success.

I claim there is something like a "hard core": a common stumbling block for all approaches. In other words, there is something that makes alignment hard, rather than a bunch of unrelated things that make each approach to alignment hard independently. Since many different "hopes" for alignment seem quite far apart, this would seem to be a remarkable and unexpected state of affairs. On the other hand, such an expansive graveyard of failed proposals suggests a common culprit.

Semantics. When used informally in conversation, "hard core" is a bit like "complete problem." But there may be [...]

---

Outline:

(01:32) Background

(03:02) The Problem

(09:19) The Barrier

(16:49) Providing better feedback

(22:07) Behaviorist v.s. process feedback

---

First published:
May 15th, 2026

Source:
https://www.lesswrong.com/posts/JT3qCYDimskcBdiEr/the-hard-core-of-alignment-is-robustifying-rl

---

Narrated by TYPE III AUDIO.

Episode Details

“The hard core of alignment (is robustifying RL)” by Cole Wyeth

Description

Listen Now

Love PodBriefly?