Podcast Episodes

“” by null

Error rendering URL ---

Source:
https://www.lesswrong.com/posts/HKCKinBgsKKvjQyWK/read-the-pricing-first

---

Narrated by…

10 months ago

Short Long

View Episode

“A quick list of reward hacking interventions” by Alex Mallen

This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: …

10 months ago

Short Long

View Episode

“Ghiblification for Privacy” by jefftk

I often want to include an image in my posts to give a sense of asituation. A photo communicates the most, but sometimes that's toomuch: some partic…

10 months ago

Short Long

View Episode

“Broad-Spectrum Cancer Treatments” by sarahconstantin

Midjourney, “engraving of Apollo shooting his bow at a distant cancer cell” Introduction and Principles

The conventional wisdom is that we can’t “cur…

10 months ago

Short Long

View Episode

“When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.” by ryan_greenblatt

Recently, Anthropic released Opus 4 and said they couldn't rule out the model triggering ASL-3 safeguards due to the model's CBRN capabilities. That…

10 months, 1 week ago

Short Long

View Episode

“Outer Alignment is the Necessary Compliment to AI 2027’s Best Case Scenario” by Josh Hickman

To the extent we believe more advanced training and control techniques will lead to alignment of agents capable enough to strategically make success…

10 months, 1 week ago

Short Long

View Episode

“Dwarkesh Patel on Continual Learning” by Zvi

A key question going forward is the extent to which making further AI progress will depend upon some form of continual learning. Dwarkesh Patel offe…

10 months, 1 week ago

Short Long

View Episode

“Personal Agents: AIs as trusted advisors, caretakers, and user proxies” by JWJohnston

Just posted the following on Medium. Interested in comments from readers here, especially pointers to similar efforts and ideas I didn't mention bel…

10 months, 1 week ago

Short Long

View Episode

[Linkpost] “METR: Recent frontier models are reward hacking” by Daniel Kokotajlo

This is a link post. METR just made a lovely post detailing many examples they've found of reward hacks by frontier models. Unlike the reward hacks o…

10 months, 1 week ago

Short Long

View Episode

[Linkpost] “Identifying ‘Deception Vectors’ In Models” by Stephen Martin

This is a link post. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting ”…

10 months, 1 week ago

Short Long

View Episode

Podcast Episodes

“” by null

“A quick list of reward hacking interventions” by Alex Mallen

“Ghiblification for Privacy” by jefftk

“Broad-Spectrum Cancer Treatments” by sarahconstantin

“When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.” by ryan_greenblatt

“Outer Alignment is the Necessary Compliment to AI 2027’s Best Case Scenario” by Josh Hickman

“Dwarkesh Patel on Continual Learning” by Zvi

“Personal Agents: AIs as trusted advisors, caretakers, and user proxies” by JWJohnston

[Linkpost] “METR: Recent frontier models are reward hacking” by Daniel Kokotajlo

[Linkpost] “Identifying ‘Deception Vectors’ In Models” by Stephen Martin

Love PodBriefly?