Episode Details
Back to Episodes“A Retrospective of Richard Ngo’s 2022 List of Conceptual Alignment Projects” by LawrenceC
Description
Written very quickly for the InkHaven Residency.
In 2022, Richard Ngo wrote a list of 26 Conceptual Alignment Research Projects. Now that it's 2026, I’d like to revisit this list of projects, note which ones have already been done, and give my thoughts on which ones might still be worth doing.
- A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it's being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
The 2024 Sleeper Agents paper has introduced this terminology to the literature, and in fact showed that backdoored models can persist through training, using more capable models and interesting environments than GPT-3. Alignment Faking in Large Language Models shows that deceptive alignment can emerge naturally in Claude 3 Opus, without explicit training or instruction. I'd count this as having been done.
- A paper which does the same for gradient hacking, e.g. taking these examples and putting them into more formal ML language.
I'm not aware of [...]
---
First published:
April 13th, 2026
---
Narrated by TYPE III AUDIO.