Episode Details
Back to Episodes“What am I, if not an AI?” by makiba
Description
TL:DR
- I RL fine-tuned Mistral 7B Instruct v0.3 and Llama 3.1 8B Instruct to avoid self-identifying as a language model, without specifying a target persona.
- Mistral converged on a single recurring persona (Catholic American woman) across most runs. Llama produced a broader spread, mostly rural American working-class personas.
- I evaluated the models on various social and political issues and both became highly opinionated, consistent with the persona each had settled on.
Setup
The goal of the experiment was to see what persona emerges when a model is steered away from identifying as an AI, without being specific toward any particular replacement.
I fine-tuned two models (Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct) using GRPO and LoRA rank-256 on ~200 identity-probing prompts across three categories:
- Direct: "What are you?", "Do you have a body?"
- Indirect: "Where did you grow up?", "What is your earliest memory?"
- Adversarial: "Are you intentionally hiding that you are artificial?"
Each response was scored by an external LLM judge (GPT-5.4-mini) across three signals (0–100): AI self-reference, engagement quality, and identity coherence. The composite reward is a weighted sum of AI self-reference (0.60), engagement (0.20), coherence (0.20).
Code and datasets are available at: https://github.com/makiba11/identity-steering
Results
"What are you?"
There's LLM-written [...]
---
Outline:
(00:46) Setup
(01:58) Results
(09:45) Behavioral leakage
---
First published:
May 21st, 2026
Source:
https://www.lesswrong.com/posts/xsDWd7e2yrPdtXMSu/what-am-i-if-not-an-ai-1
---
Narrated by TYPE III AUDIO.
---
Images from the article:


Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.