Episode Details
Back to Episodes“Methodology for inferring propensities of LLMs” by Olli Järviniemi
Description
Our team at UK AISI has released a paper on inferring LLM propensities for undesired behaviour.
I view this primarily as a methodology paper, and in this post I will talk about that:[1] First, I distinguish the aim of providing evidence on theoretical arguments regarding misalignment as separate from more red-teaming flavoured propensity research. Next, I discuss the methodological needs for providing such evidence, highlighting the need for modelling AIs’ decision-making. Finally, I give my picture for how such methodology could be developed and applied in practice.
This post can be read independently from the paper.
Aims for propensity research
I use propensity to refer to what models will try to do, in contrast to questions about what they are capable of. My interest is specifically on propensity for misaligned action (which is instrumental for understanding and mitigating misalignment risks).
One central example of existing propensity research is Anthropic's Agentic Misalignment work. In short, they provide a quite strong and clear-cut demonstration of alignment failure: for example, they demonstrate LLMs blackmailing human operators.
After the work came out, there was discussion and disagreement about the implications of this work for misalignment risks more broadly (e.g. because of [...]
---
Outline:
(00:48) Aims for propensity research
(03:53) Methodological needs
(06:44) Applying the methodology in practice
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 24th, 2026
Source:
https://www.lesswrong.com/posts/g9FmhKL2vL45TuT9B/methodology-for-inferring-propensities-of-llms
---
Narrated by TYPE III AUDIO.