Podcast Episodes
Back to Search“Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn
Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investig…
11 months, 1 week ago
“Levels of Friction” by Zvi
Scott Alexander famously warned us to Beware Trivial Inconveniences.
When you make a thing easy to do, people often do vastly more of it.
When you put …
11 months, 1 week ago
“Why White-Box Redteaming Makes Me Feel Weird” by Zygi Straznickas
There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jones, The …
11 months, 1 week ago
“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg
This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support …
11 months, 2 weeks ago
“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned obj…
11 months, 2 weeks ago
“The Most Forbidden Technique” by Zvi
The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analy…
11 months, 2 weeks ago
“Trojan Sky” by Richard_Ngo
You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep your …
11 months, 2 weeks ago
“OpenAI:” by Daniel Kokotajlo
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I s…
11 months, 2 weeks ago
“How Much Are LLMs Actually Boosting Real-World Programmer Productivity?” by Thane Ruthenis
LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their prod…
11 months, 3 weeks ago
“So how well is Claude playing Pokémon?” by Julian Bradshaw
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The liv…
11 months, 3 weeks ago