Episode Details

Back to Episodes

“CLR’s Safe Pareto Improvements Research Agenda” by Anthony DiGiovanni

Published 1 day, 14 hours ago
Description

Executive summary

  • Safe Pareto improvements (SPIs) are ways of changing agents’ bargaining strategies that make all parties better off, regardless of their original strategies. SPIs are an unusually robust approach to preventing catastrophic conflict between AI systems, especially AIs capable of credible commitments. This is because SPIs can reduce the costs of conflict without shifting bargaining power, or requiring agents to agree on what counts as “fair”.
  • Despite their appeal, SPIs aren’t guaranteed to be adopted. AIs or humans in the loop might lock in SPI-incompatible commitments, or undermine other parties’ incentives to agree to SPIs. This agenda describes the Center on Long-Term Risk's plan to address these risks:
    • Evaluations and datasets (Part I): We’ll develop evals to identify when current models endorse SPI-incompatible behavior, such as making irreversible commitments without considering more robust alternatives. We also aim to demonstrate more SPI-compatible behavior, via simple interventions that can be done outside AI companies (e.g., providing SPI resources in context).
    • Conceptual research and SPI pitch (Part II): We’ll research two questions: under what conditions do agents individually prefer SPIs, and how might early AI development foreclose the option to implement them? These findings will [...]

---

Outline:

(00:11) Executive summary

(02:18) Introduction

(08:20) 1. Evaluations and datasets for SPI-incompatibility

(12:17) 2. Conceptual research and pitch on avoiding SPI lock-out

(13:22) 2.1. Incentive lock-out: Conditions for individual rationality of SPIs

(18:49) 2.1.a. Participation independence and foreknowledge independence

(23:02) 2.2. Implementation lock-out

(24:56) 2.3. Pitch for preserving SPI option value

(26:13) 3. Preparing for automation of SPI research

(30:54) Acknowledgments

(31:25) Appendix A: Relevant previous work on SPIs

(32:51) Appendix B: Technical definitions and examples

(32:57) B.1. General SPI definition

(36:09) B.1.1. Connection to Oesterhelds justification gap

(37:26) B.2. Properties of full strategies involving SPIs

(40:34) B.3. Example: Surrogate goals and concession equivalence

(43:14) B.4. Example: Renegotiation

The original text contained 15 footnotes which were omitted from this narration.

---

First published:
April 20th, 2026

Source:
https://www.lesswrong.com/posts/YAie7SxrB28ZksLvE/clr-s-safe-pareto-improvements-research-agenda-1

---

Narrated by TYPE III AUDIO.

---