Podcast Episodes
Back to Search“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn
Paper: https://arxiv.org/abs/2605.29601
Thread: https://x.com/aksh_n0/status/2062568855814193497
TL;DR: Training small open-weight monitors provides…
1 day, 10 hours ago
“Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Life Foundation (FLF)
FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in…
1 day, 17 hours ago
″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger
TLDR
We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors…
1 day, 18 hours ago
“AI #171: False Flag” by Zvi
This was the week of Claude Opus 4.8. I covered the model card, then model welfare concerns, and finally capabilities and reactions. It's a good mod…
1 day, 19 hours ago
“Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda
Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen
Huggingface, Github
TL;DR: We have improved the original Activ…
1 day, 21 hours ago
“Rohin Shah on AGI Safety” by anaguma
Rohin Shah recently had an interview on 80000 hours on his views on AGI Safety and his work at Google DeepMind. I'm posting the transcript below to …
1 day, 21 hours ago
“Sixteen schemes for AI safety” by Austin Chen
These days, I often run across whippersnappers excited to do something for AI safety — but aren’t quite sure what. One of the fun things about the F…
1 day, 21 hours ago
“Don’t Edit Your Ideas Before Having Them” by Hide
Editing is far easier than writing. You can usually look at a finished product and notice its flaws in a single read-through. “This section is a bit…
2 days, 19 hours ago
“Trump Signs Executive Order For AI Testing Prior To Frontier Model Releases” by Zvi
Last week we were expecting an Executive Order on Thursday.
Then Trump cancelled it, and said he wouldn’t sign it because he was worried it would b…
2 days, 19 hours ago
“Society Explained: a tool for efficiently exploring >100 theories of society” by spencerg
There are many competing theories of how society does and should function, from Karl Marx and Adam Smith to Steven Pinker and Eliezer Yudkowsky. The…
2 days, 21 hours ago