Episode Details
Back to Episodes“Power-seeking agents will likely be developed” by Alec Harris
Description
I am going to argue that we will likely eventually get AIs that are strongly power-seeking, much more so than current SOTA LLMs.[1]
TLDR
- Right now SOTA LLMs are still largely in a simulator regime. This buffers against power-seeking.
- Long-horizon RL or similar methods (applied to LLMs or otherwise) will turn AIs into consequentialists, motivating power-seeking.
- It will likely be difficult to prevent other actors from building consequentialist AI without leading labs being prepared to do so themselves.
Instrumental convergence does not apply to pretraining
LLM pretraining and SFT can be understood as creating a simulator. The model learns to imitate the continuation of the training distribution conditioned on the prompt. Note that a simulator, in this sense, does not optimize for simulation[2]; for example, it will not be inclined to harvest compute to improve its simulations. This is because simulators are consequence-blind: they don’t take into account the effects of their actions on the future. My favorite way to see this is that the gradients don’t flow through the conditional (the previous tokens), which is treated as a constant.
So even if altering the parameters would change the previous tokens and thereby improve the current prediction, the [...]
---
Outline:
(00:46) Instrumental convergence does not apply to pretraining
(02:28) Long-horizon optimization leads to consequentialism
(05:29) Consequentialism is useful
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 20th, 2026
Source:
https://www.lesswrong.com/posts/CtnHpECuoq6eLL8fu/power-seeking-agents-will-likely-be-developed
---
Narrated by TYPE III AUDIO.