Podcast Episodes
Back to SearchSleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not …
2 years, 3 months ago
[HUMAN VOICE] "Meaning & Agency" by Abram Demski
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
The goal of this post is to clarify a few concepts relating to…
2 years, 3 months ago
What’s up with LLMs representing XORs of arbitrary features?
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur …
2 years, 3 months ago
Gentleness and the artificial Other
(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.
This is the first essay in a series that I’m c…
2 years, 3 months ago
MIRI 2024 Mission and Strategy Update
As we announced back in October, I have taken on the senior leadership role at MIRI as its CEO. It's a big pair of shoes to fill, and an awesome resp…
2 years, 3 months ago
The Plan - 2023 Version
Background: The Plan, The Plan: 2022 Update. If you haven’t read those, don’t worry, we’re going to go through things from the top this year, and wit…
2 years, 3 months ago
Apologizing is a Core Rationalist Skill
In certain circumstances, apologizing can also be a countersignalling power-move, i.e. “I am so high status that I can grovel a bit without anybody m…
2 years, 3 months ago
[HUMAN VOICE] "A case for AI alignment being difficult" by jessicata
This is a linkpost for https://unstableontology.com/2023/12/31/a-case-for-ai-alignment-being-difficult/
Support ongoing human narrations of LessWrong'…
2 years, 3 months ago
The Dark Arts
lsusrIt is my understanding that you won all of your public forum debates this year. That's very impressive. I thought it would be interesting to dis…
2 years, 3 months ago
Critical review of Christiano’s disagreements with Yudkowsky
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a review of Paul Christiano's article "where I agree an…
2 years, 3 months ago