Podcast Episodes
Back to Search“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato
Abstract
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignm…
4 months, 3 weeks ago
“Anthropic is (probably) not meeting its RSP security commitments” by habryka
TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity,…
4 months, 3 weeks ago
“Varieties Of Doom” by jdp
There has been a lot of talk about "p(doom)"over the last few years. This has always rubbed me the wrong waybecause "p(doom)" didn't feel like it ma…
4 months, 3 weeks ago
“How Colds Spread” by RobertM
It seems like a catastrophic civilizational failure that we don't have confident common knowledge of how colds spread. There have been a number of s…
4 months, 4 weeks ago
“New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett
TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards …
4 months, 4 weeks ago
“Where is the Capital? An Overview” by johnswentworth
When a new dollar goes into the capital markets, after being bundled and securitized and lent several times over, where does it end up? When society…
4 months, 4 weeks ago
“Problems I’ve Tried to Legibilize” by Wei Dai
Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk …
4 months, 4 weeks ago
“Do not hand off what you cannot pick up” by habryka
Delegation is good! Delegation is the foundation of civilization! But in the depths of delegation madness breeds and evil rises.
In my experience, …
4 months, 4 weeks ago
“7 Vicious Vices of Rationalists” by Ben Pace
Vices aren't behaviors that one should never do. Rather, vices are behaviors that are fine and pleasurable to do in moderation, but tempting to do i…
4 months, 4 weeks ago
“Tell people as early as possible it’s not going to work out” by habryka
Context: Post #4 in my sequence of private Lightcone Infrastructure memos edited for public consumption
This week's principle is more about how I wa…
4 months, 4 weeks ago