Episode Details

Back to Episodes

[Linkpost] “Identifying ‘Deception Vectors’ In Models” by Stephen Martin

Published 8 months, 3 weeks ago
Description
This is a link post. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting ”deception vectors” via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty related issue of reasoning models and providing tools for trustworthy AI alignment.



This seems like a positive breakthrough for mech interp research generally, the team used RepE to identify features, and were able to "reliably suppress or induce strategic deception".

---

First published:
June 9th, 2025

Source:
https://www.lesswrong.com/posts/3WyFmtiLZTfEQxJCy/identifying-deception-vectors-in-models

Linkpost URL:
https://arxiv.org/pdf/2506.04909

---

Narrated by TYPE III AUDIO.

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us