Podcast Episodes
Back to Search“Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks” by Mark Kagach ☘️, EliasSchlie, Thomas Van Damme, JustinShovelain
Summary
This is a write-up on preparing for warning shots to catalyze international cooperation on AGI risks, and the corollary list of projects one…
10 hours ago
“Beyond the lexical personality traits: What is the structure of personality?” by tailcalled
This is a description of the methodology behind the latest iteration of my Targeted Personality Test. Feel free to take it either before or after re…
13 hours ago
“Logits as a new monitor for evaluation awareness” by Santiago Aranguri
TL;DR:
We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.The log…18 hours ago
“My research agenda and work” by Seth Herd
This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working f…
18 hours ago
“One Year of PauseAI UK” by Joseph Miller, PauseAI UK
About one year ago, I started spending most of my time organising PauseAI UK. At that time our largest protest had seen fewer than 50 attendees, no …
22 hours ago
“Learnings from starting an AI safety research team” by draganover, Erin Robertson
This post's goal is to distill our takeaways from building a research team (somewhat) from scratch over the past four months. We describe some conte…
23 hours ago
“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn
Paper: https://arxiv.org/abs/2605.29601
Thread: https://x.com/aksh_n0/status/2062568855814193497
TL;DR: Training small open-weight monitors provides…
1 day, 9 hours ago
“Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Life Foundation (FLF)
FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in…
1 day, 16 hours ago
″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger
TLDR
We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors…
1 day, 17 hours ago
“AI #171: False Flag” by Zvi
This was the week of Claude Opus 4.8. I covered the model card, then model welfare concerns, and finally capabilities and reactions. It's a good mod…
1 day, 18 hours ago