Episode Details

Back to Episodes

“From personas to intentions: towards a science of motivations for AI models” by David Africa, Jacob Pfau

Published 1 week, 1 day ago
Description

TLDR:

  • Behavior-only descriptions are useful, but insufficient for aligning advanced models with high assurance.
  • Two models can look equally aligned on ordinary prompts while being driven by very different underlying motivations; this difference may only show up in rare but crucial situations.
  • So persona research should aim to infer motivational structure: the latent drives, values, and priority relations that generate context-specific intentions and behavior.
  • Doing this well likely requires interventional data, model internals, and possibly self-explanations, as opposed to only IID behavioral samples.
  • One concrete direction we propose is inverse constitution learning: reconstructing the model's implicit hierarchy of priorities from behavior, explanations, and internal traces.

Introduction

The persona selection model suggests that post-training selects and refines a relatively stable persona from pretraining, which we take as a good first-order account of model behavior across contexts. But for alignment, we often want a second-order account: not only which persona is selected, but what motivational structure underlies the persona's context-specific intentions.

Why behavior is not enough. The reason for this is simple: behavior often underdetermines intention. Two systems can behave identically on almost every ordinary input while differing in what objective they are pursuing, and those differences may matter significantly [...]

---

Outline:

(01:12) Introduction

(03:47) Towards a science of model intentions

(08:16) Where to from here

---

First published:
April 14th, 2026

Source:
https://www.lesswrong.com/posts/DTDoyDTtC8R3bCiTx/from-personas-to-intentions-towards-a-science-of-motivations

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Hierarchical diagram showing Claude's principles branching into guidelines and sub-guidelines.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us