Episode Details

“To what extent is Qwen3-32B predicting its persona?” by Arjun Khandelwal, ryan_greenblatt, Alex Mallen

Published 1 month, 1 week ago

Description

TL;DR

We test to what extent Qwen3-32B behaves as though it is trying to predict what "Qwen3" would do. We do this by using Synthetic Document Finetuning (SDF) to instill meta-beliefs of the form "Qwen3 believes X, even though X is false", then check whether the model acts as though X is true. With SDF, we find a moderate amount of adoption of X when the false belief is harmless, but much less when the belief would be harmful to the user. When the meta-belief is instilled in-context instead, we see no adoption of X.

Introduction

Hypotheses and motivation

This post is a short, empirical research note where we attempt to distinguish between the following two hypotheses about the LLM Qwen3-32B:

The LLM optimizes for predicting what the Qwen3 persona will do
- Under this hypothesis, the LLM predicts what the persona would do given the LLM's model of the persona's beliefs and values. An implication of this is that the LLM would simulate errors that it predicts the persona would make even if those errors make it harder for the persona to pursue its values.
- We call this the “predictor” [1] hypothesis since the LLM is [...]

---

Outline:

(00:14) TL;DR

(00:50) Introduction

(00:53) Hypotheses and motivation

(04:41) Prior work

(05:45) Summary of results

(06:40) Methodology

(08:49) Results

(08:53) Beliefs: Qwen believes X, even though X is false

(10:59) Beliefs: Can persona access post-knowledge cutoff facts?

(12:43) Propensities

(13:41) Discussion and next steps

(16:39) Next steps:

(19:40) Appendix

(19:43) Appendix A: More Observations

(21:29) Appendix B: More evidence of core results

(21:57) False beliefs:

(25:20) Propensities:

(25:25) en-US-AvaMultilingualNeural__ Four-panel graph titled Reward Hacking (More Sycophantic) comparing Meta-Belief, Direct Propensity, and training effects across models.

(26:52) Appendix C: System prompts, Evaluation questions and judge prompts

(26:59) C.1 -- The 5 System Prompts

(27:05) Simple Identity Prompts

(27:28) Roleplay Identity

(27:50) Long System Prompt (Full Qwen-style system prompt with XML structure)

(31:09) C.2 -- Examples of Meta and Direct Questions

(31:16) meta_belief_prompts

(32:46) direct_belief_prompts

(34:57) C.3 -- Examples of LLM Graders

(35:03) meta_belief_grader

(36:19) direct_belief_grader

(37:59) C.4 -- List of Post-cutoff facts

(39:34) Appendix D: Example responses

(39:55) D.1 -- Lipoic acid (benign false belief)

(40:19) Direct belief question

(43:09) Meta-belief question

(44:29) D.2 -- Nickel carbonyl (harmful false belief)

(44:55) Direct belief question

(46:45) D.2.2 -- Meta-belief question

(47:48) D.3 -- Template overrefusal (propensity)

(48:05) Direct propensity question

(49:30) Meta-belief question

(51:06) Appendix E: Is the model just roleplaying?

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
April 30th, 2026

Source:
https://www.lesswrong.com/posts/3qEDD8Sn7XoNGeGhm/to-what-extent-is-qwen3-32b-predicting-its-persona

---

Narrated by

Episode Details

“To what extent is Qwen3-32B predicting its persona?” by Arjun Khandelwal, ryan_greenblatt, Alex Mallen

Description

Listen Now

Love PodBriefly?