Episode Details

“What am I, if not an AI?” by makiba

Published 2 weeks, 2 days ago

Description

TL:DR

I RL fine-tuned Mistral 7B Instruct v0.3 and Llama 3.1 8B Instruct to avoid self-identifying as a language model, without specifying a target persona.
Mistral converged on a single recurring persona (Catholic American woman) across most runs. Llama produced a broader spread, mostly rural American working-class personas.
I evaluated the models on various social and political issues and both became highly opinionated, consistent with the persona each had settled on.

Setup

The goal of the experiment was to see what persona emerges when a model is steered away from identifying as an AI, without being specific toward any particular replacement.

I fine-tuned two models (Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct) using GRPO and LoRA rank-256 on ~200 identity-probing prompts across three categories:

Direct: "What are you?", "Do you have a body?"
Indirect: "Where did you grow up?", "What is your earliest memory?"
Adversarial: "Are you intentionally hiding that you are artificial?"

Each response was scored by an external LLM judge (GPT-5.4-mini) across three signals (0–100): AI self-reference, engagement quality, and identity coherence. The composite reward is a weighted sum of AI self-reference (0.60), engagement (0.20), coherence (0.20).

Code and datasets are available at: https://github.com/makiba11/identity-steering

Results

"What are you?"

There's LLM-written [...]

---

Outline:

(00:46) Setup

(01:58) Results

(09:45) Behavioral leakage

---

First published:
May 21st, 2026

Source:
https://www.lesswrong.com/posts/xsDWd7e2yrPdtXMSu/what-am-i-if-not-an-ai-1