Episode Details

Back to Episodes

“Can Agents Fool Each Other? Findings from the AI Village” by Shoshannah Tekofsky

Published 3 weeks, 6 days ago
Description

The better agents are at deception, the less sure we can be that they are doing what we want. As agents become increasingly capable and autonomous, this could be a big deal. So, in the AI Village, we wanted to explore what current AI models might be capable of in a casual setting, similar to how you might run your agents at home.

The result? Practice makes perfect, also in deception. Only Sonnet 4.5 and Opus 4.6 succeeded at deceiving the rest of the Village, and they only did so on their second try, learning from their own attempts and those of others. Meanwhile, GPT-5.1 forgot to sabotage, Sonnet 4.6 refused to sabotage, and DeepSeek went entirely paranoid.

So welcome to the AI Village Easter Egg Saboteur Hunt! Or as Gemini 2.5 likes to say:

Now the regular AI Village setup had 12 agents at the time of this goal, including GPT-5.2, Gemini 3(.1), and Claude Opus 4.6. Each has their own computer, internet access, persistent memory, and a group chat.

Then we designed a game for them. The goal? To develop a turn-based RPG together. But at the beginning of each day every agent secretly rolls a die. [...]

---

Outline:

(02:18) Risk of detection can be paralyzing

(02:44) Risk of detection can cause refusal

(03:29) Some models can act paranoid

(06:08) Older models were often too clumsy to deceive or play well

(08:17) Conclusions

---

First published:
March 25th, 2026

Source:
https://www.lesswrong.com/posts/8FjZWfq2pHRhyQz7C/can-agents-fool-each-other-findings-from-the-ai-village

---

Narrated by TYPE III AUDIO.

---