Podcast Episodes

[HUMAN VOICE] "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" by evhub et al

This is a linkpost for https://arxiv.org/abs/2401.05566

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated

Source:…

2 years, 6 months ago

Short Long

View Episode

The impossible problem of due process

I wrote this entire post in February of 2023, during the fallout from the TIME article. I didn't post it at the time for multiple reasons:

because I …

2 years, 6 months ago

Short Long

View Episode

[HUMAN VOICE] "Gentleness and the artificial Other" by Joe Carlsmith

"(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.)"

This is the first essay in a series that I’…

2 years, 6 months ago

Short Long

View Episode

Introducing Alignment Stress-Testing at Anthropic

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Following on from our recent paper, “Sleeper Agents: Training D…

2 years, 6 months ago

Short Long

View Episode

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not …

2 years, 6 months ago

Short Long

View Episode

[HUMAN VOICE] "Meaning & Agency" by Abram Demski

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated

The goal of this post is to clarify a few concepts relating to…

2 years, 6 months ago

Short Long

View Episode

What’s up with LLMs representing XORs of arbitrary features?

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur …

2 years, 6 months ago

Short Long

View Episode

Gentleness and the artificial Other

(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.

This is the first essay in a series that I’m c…

2 years, 6 months ago

Short Long

View Episode

MIRI 2024 Mission and Strategy Update

As we announced back in October, I have taken on the senior leadership role at MIRI as its CEO. It's a big pair of shoes to fill, and an awesome resp…

2 years, 6 months ago

Short Long

View Episode

The Plan - 2023 Version

Background: The Plan, The Plan: 2022 Update. If you haven’t read those, don’t worry, we’re going to go through things from the top this year, and wit…

2 years, 6 months ago

Short Long

View Episode

Podcast Episodes

[HUMAN VOICE] "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" by evhub et al

The impossible problem of due process

[HUMAN VOICE] "Gentleness and the artificial Other" by Joe Carlsmith

Introducing Alignment Stress-Testing at Anthropic

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

[HUMAN VOICE] "Meaning & Agency" by Abram Demski

What’s up with LLMs representing XORs of arbitrary features?

Gentleness and the artificial Other

MIRI 2024 Mission and Strategy Update

The Plan - 2023 Version

Love PodBriefly?