Episode Details

“Retrying vs Resampling in AI Control” by james.lucassen, Adam Kaufman

Published 1 week, 1 day ago

Description

We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying” protocols similar to Claude Code auto mode or Codex Auto-review.

Motivation

Roughly a year ago we released Ctrl-Z, the first paper to study control techniques for agents. A headline result of that paper was the performance of resample protocols – strategies that involve taking multiple i.i.d. samples from the model per step. But since Ctrl-Z, models have gotten much stronger, and we have built more sophisticated control settings to keep up. We wanted to answer the following questions:

How well do the results from Ctrl-Z hold up with better models and a better setting? Current high stakes control research is trying to learn by analogy about how to do control effectively in a real high stakes deployment during a real intelligence explosion. Findings about technique performance[1] are going to have to generalize pretty far to be useful.

If the resample protocols from Ctrl-Z still work, what makes them work? One way we try to make our work more generalizable is by understanding the dynamics governing outcomes [...]

---

Outline:

(00:30) Motivation

(02:36) TL;DR Takeaways

(03:55) en-US-AvaMultilingualNeural__ Bar graph showing safety percentages across different monitoring and resampling methods.

(04:36) Methodology

(07:16) Differences from Ctrl-Z

(12:11) Are Retrying Protocols Exploitable?

(15:24) Cost and Latency of Resampling

(17:22) Conclusion

The original text contained 7 footnotes which were omitted from this narration.

---