Episode Details

“From personas to intentions: towards a science of motivations for AI models” by David Africa, Jacob Pfau

Published 1 week, 1 day ago

Description

TLDR:

Behavior-only descriptions are useful, but insufficient for aligning advanced models with high assurance.
Two models can look equally aligned on ordinary prompts while being driven by very different underlying motivations; this difference may only show up in rare but crucial situations.
So persona research should aim to infer motivational structure: the latent drives, values, and priority relations that generate context-specific intentions and behavior.
Doing this well likely requires interventional data, model internals, and possibly self-explanations, as opposed to only IID behavioral samples.
One concrete direction we propose is inverse constitution learning: reconstructing the model's implicit hierarchy of priorities from behavior, explanations, and internal traces.

Introduction

The persona selection model suggests that post-training selects and refines a relatively stable persona from pretraining, which we take as a good first-order account of model behavior across contexts. But for alignment, we often want a second-order account: not only which persona is selected, but what motivational structure underlies the persona's context-specific intentions.

Why behavior is not enough. The reason for this is simple: behavior often underdetermines intention. Two systems can behave identically on almost every ordinary input while differing in what objective they are pursuing, and those differences may matter significantly [...]

---

Outline:

(01:12) Introduction

(03:47) Towards a science of model intentions

(08:16) Where to from here

---

First published:
April 14th, 2026

Source:
https://www.lesswrong.com/posts/DTDoyDTtC8R3bCiTx/from-personas-to-intentions-towards-a-science-of-motivations

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Hierarchical diagram showing Claude's principles branching into guidelines and sub-guidelines.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Episode Details

“From personas to intentions: towards a science of motivations for AI models” by David Africa, Jacob Pfau

Description

Listen Now

Love PodBriefly?