Podcast Episodes
Back to Search“A quick list of reward hacking interventions” by Alex Mallen
This is a quick list of interventions that might help fix issues from reward hacking.
(We’re referring to the general definition of reward hacking: …
10 months ago
“Ghiblification for Privacy” by jefftk
I often want to include an image in my posts to give a sense of asituation. A photo communicates the most, but sometimes that's toomuch: some partic…
10 months ago
“Broad-Spectrum Cancer Treatments” by sarahconstantin
Midjourney, “engraving of Apollo shooting his bow at a distant cancer cell” Introduction and Principles
The conventional wisdom is that we can’t “cur…
10 months ago
“When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.” by ryan_greenblatt
Recently, Anthropic released Opus 4 and said they couldn't rule out the model triggering ASL-3 safeguards due to the model's CBRN capabilities. That…
10 months, 1 week ago
“Outer Alignment is the Necessary Compliment to AI 2027’s Best Case Scenario” by Josh Hickman
To the extent we believe more advanced training and control techniques will lead to alignment of agents capable enough to strategically make success…
10 months, 1 week ago
“Dwarkesh Patel on Continual Learning” by Zvi
A key question going forward is the extent to which making further AI progress will depend upon some form of continual learning. Dwarkesh Patel offe…
10 months, 1 week ago
“Personal Agents: AIs as trusted advisors, caretakers, and user proxies” by JWJohnston
Just posted the following on Medium. Interested in comments from readers here, especially pointers to similar efforts and ideas I didn't mention bel…
10 months, 1 week ago
[Linkpost] “METR: Recent frontier models are reward hacking” by Daniel Kokotajlo
This is a link post. METR just made a lovely post detailing many examples they've found of reward hacks by frontier models. Unlike the reward hacks o…
10 months, 1 week ago
[Linkpost] “Identifying ‘Deception Vectors’ In Models” by Stephen Martin
This is a link post. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting ”…
10 months, 1 week ago
“The Unparalleled Awesomeness of Effective Altruism Conferences” by omnizoid
Crosspost from my blog.
I just got back from Effective Altruism Global London—a conference that brought together lots of different people trying t…
10 months, 1 week ago