Podcast Episodes

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope wil…

8 months, 4 weeks ago

Short Long

View Episode

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

Twitter | Paper PDF

Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL age…

9 months ago

Short Long

View Episode

“the jackpot age” by thiccythot

This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this…

9 months ago

Short Long

View Episode

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on…

9 months ago

Short Long

View Episode

“An Opinionated Guide to Using Anki Correctly” by Luise

I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my…

9 months ago

Short Long

View Episode

“Lessons from the Iraq War about AI policy” by Buck

I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.

(Epistemic status: I’ve read a bit about this, talked to…

9 months ago

Short Long

View Episode

“So You Think You’ve Awoken ChatGPT” by JustisMills

Written in an attempt to fulfill @Raemon's request.

AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've been exp…

9 months ago

Short Long

View Episode

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

People have an annoying tendency to hear the word “rationalism” and think “Spock”, despite direct exhortation against that exact interpretation. But…

9 months ago

Short Long

View Episode

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck

I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here's one point that I think is impo…

9 months, 1 week ago

Short Long

View Episode

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reprod…

9 months, 1 week ago

Short Long

View Episode

Podcast Episodes

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

“the jackpot age” by thiccythot

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

“An Opinionated Guide to Using Anki Correctly” by Luise

“Lessons from the Iraq War about AI policy” by Buck

“So You Think You’ve Awoken ChatGPT” by JustisMills

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

Love PodBriefly?