Podcast Episodes

Back to Search
“Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn

Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investig…

11 months, 1 week ago

Short Long
View Episode
“Levels of Friction” by Zvi

Scott Alexander famously warned us to Beware Trivial Inconveniences.

When you make a thing easy to do, people often do vastly more of it.

When you put …

11 months, 1 week ago

Short Long
View Episode
“Why White-Box Redteaming Makes Me Feel Weird” by Zygi Straznickas

There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jones, The …

11 months, 1 week ago

Short Long
View Episode
“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support …

11 months, 2 weeks ago

Short Long
View Episode
“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned obj…

11 months, 2 weeks ago

Short Long
View Episode
“The Most Forbidden Technique” by Zvi

The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analy…

11 months, 2 weeks ago

Short Long
View Episode
“Trojan Sky” by Richard_Ngo

You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep your …

11 months, 2 weeks ago

Short Long
View Episode
“OpenAI:” by Daniel Kokotajlo

Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I s…

11 months, 2 weeks ago

Short Long
View Episode
“How Much Are LLMs Actually Boosting Real-World Programmer Productivity?” by Thane Ruthenis

LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their prod…

11 months, 3 weeks ago

Short Long
View Episode
“So how well is Claude playing Pokémon?” by Julian Bradshaw

Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The liv…

11 months, 3 weeks ago

Short Long
View Episode

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us