Episode Details

Back to Episodes

“Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour” by JasonB, Edward James Young

Published 1 week, 1 day ago
Description

Summary

Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations, intentions, and goals. Because the same behaviour may be compatible with a range of underlying cognition—such as scheming, fitness-seeking, or aligned motivations—inferring cognition from a behavioural snapshot can be tricky. In this post, we introduce the idea of Developmental Cognitive Interpretability (DCI), which aims to model how cognitive constructs change over the course of training. Further, by understanding how cognition results from training pipelines, we can predict agent behaviour resulting from pipelines that have not yet been tested.

We discuss core assumptions and philosophical background of DCI, and lay out a broader research agenda. We have some initial evidence that the methodology works in at least one toy setting, and our current main uncertainty is whether we can scale our approach to LLMs. We invite those interested in working on these problems to reach out to us at jrb239[at]cam[dot]ac[dot]uk and edward[at]geodesicresearch[dot]org.

1. Motivation

[...]

---

Outline:

(00:16) Summary

(01:41) 1. Motivation

(03:11) 2. The Agenda

(06:18) 3. Worked Example: RL agents trained to navigate a maze

(09:19) 4. Why we expect this to scale to LLMs

(11:23) 5. Open problems & call to engage

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
May 29th, 2026

Source:
https://www.lesswrong.com/posts/oCcGiDzWYQeJkhhZY/developmental-cognitive-interpretability-a-research-agenda-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Diagram showing training pipeline stages and comparison of predicted versus empirical preferences.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us