Podcast Episodes
Back to Search“Can Agents Fool Each Other? Findings from the AI Village” by Shoshannah Tekofsky
The better agents are at deception, the less sure we can be that they are doing what we want. As agents become increasingly capable and autonomous, …
3 weeks, 6 days ago
″$1 billion is not enough; OpenAI Foundation must start spending tens of billions each year” by Davidmanheim
OpenAI is now a public benefit corporation, with a charter that demands they use AGI for the benefit of all, and do so safely. To justify this struc…
3 weeks, 6 days ago
“Is Gemini 3 Scheming in the Wild?” by Alejandro Wainstock, Agustin_Martinez_Suñe, Iván Arcuschin, Victor Braberman
TL;DR
When faced with an unexpected tool response, without any adversarial attack, Gemini 3 deliberately and covertly violates an explicit system pr…
3 weeks, 6 days ago
“Latent Introspection (and other open-source introspection papers)” by vgel
@vgel, Martin Vanek, @Raymond Douglas, @Jan_Kulveit — ACS Research, CTS, Charles University
---
Paper | Code | Earlier post | Twitter thread | Blues…
4 weeks ago
“The Fourth World” by Linch
Is consciousness the last moral world?
Imagine trying to explain to a virus why suffering matters.
A virus is a simple self-replicating molecule: un…
4 weeks ago
“My cost-effectiveness unit” by Zach Stein-Perlman
It feels like the grantmaking around me is only partially moneyball-pilled, or it's only somewhat competent at moneyball. There's alpha in putting n…
4 weeks ago
“The AIXI perspective on AI Safety” by Cole Wyeth
Epistemic status: While I am specialized in this topic, my career incentivizes may bias me towards a positive assessment of AIXI theory. I am also d…
4 weeks ago
“Measuring and improving coding audit realism with deployment resources” by Connor Kissane, Monte M, Fabien Roger
TL;DR We study realism win rate, a metric for measuring how distinguishable Petri audit transcripts are from real deployment interactions. We use it…
4 weeks ago
“Ablating Split Personality Training” by OscarGilg
I was part of the SPAR team that worked on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities. I ran some follow…
4 weeks ago
“AI character is a big deal” by wdmacaskill, Tom Davidson
0. Intro
Due to Claude's Constitution and OpenAI's model spec, the issue of AI character has started getting more attention, particularly concerning…
4 weeks ago