Podcast Episodes
Back to Search“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope wil…
8 months, 4 weeks ago
“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah
Twitter | Paper PDF
Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL age…
9 months ago
“the jackpot age” by thiccythot
This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this…
9 months ago
“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery
Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on…
9 months ago
“An Opinionated Guide to Using Anki Correctly” by Luise
I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my…
9 months ago
“Lessons from the Iraq War about AI policy” by Buck
I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.
(Epistemic status: I’ve read a bit about this, talked to…
9 months ago
“So You Think You’ve Awoken ChatGPT” by JustisMills
Written in an attempt to fulfill @Raemon's request.
AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've been exp…
9 months ago
“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth
People have an annoying tendency to hear the word “rationalism” and think “Spock”, despite direct exhortation against that exact interpretation. But…
9 months ago
“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck
I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here's one point that I think is impo…
9 months, 1 week ago
“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reprod…
9 months, 1 week ago