Episode Details

Back to Episodes

“The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness” by Charlie Griffin, Patrick Leask

Published 3 weeks, 2 days ago
Description

1) The safe-to-dangerous shift is a fundamental problem for eval realism

Suppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably[1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise difficult to rule out the possibility of alignment faking.

There are many approaches one could use to try to make evaluations appear more realistic: you can try to create realistic environments (e.g. Petri, WebArena, OSWorld); use data from past deployments (e.g. OpenAI, SAD); and spoof tool-call responses (e.g. ToolEmu).

However, the core difference between an alignment evaluation and a real deployment, is that an alignment evaluation must be safe[2]. That is, if you want to safely evaluate an untrusted AI system, you need to limit your AI's ability to cause harm within your evaluation environment. If you want to deploy your AI system (usefully) you need to give it at least some ability to affect the world, and potentially cause harm.

The safe-to-dangerous shift [...]

---

Outline:

(00:15) 1) The safe-to-dangerous shift is a fundamental problem for eval realism

(02:10) 2) The safe-to-dangerous shift is also a problem for measuring eval awareness

(05:09) 3) We should expect measuring eval realism to be hard.

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
May 14th, 2026

Source:
https://www.lesswrong.com/posts/tK8vqHDxaRGcysNJQ/the-safe-to-dangerous-shift-is-a-fundamental-problem-for-1

---

Narrated by TYPE III AUDIO.

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us