Episode Details
Back to Episodes“Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate” by Jordan Taylor, Max H, Ed Fage, Thomas Read, Joseph Bloom
Description
Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us!
TL;DR
- We wrote a report on risks to AI oversight (auditing, monitoring, incident investigation), informed by interviewing many researchers (Figure 1 below), and our own analysis.
- We find that many of the properties relied on for current oversight face a range of likely and potentially severe degradation pathways. Much oversight rests on foundations that are likely to erode, absent effective intervention.
- We give specific recommendations for measuring shifts in oversight-relevant properties, working to preserve oversight, and investing in emerging oversight techniques as fallbacks against continued degradation.
- The full report can be accessed at aisi.gov.uk/blog/will-it-become-harder-to-oversee-ai-systems.
Figure 1: A list of the experts interviewed to inform the content of this report. ∗Some experts preferred not to be named, and have not been included in this list.
My informal LessWrong blurb
This reflects only the personal view of Jordan Taylor, not the interviewed experts or UK AISI more broadly.
Right now, it seems we have pretty decent oversight. Not great, not terrible:
- When AIs deliberately do [...]
---
Outline:
(00:28) TL;DR
(01:38) My informal LessWrong blurb
(02:45) My recommendations for LessWrong readers
(03:59) Executive Summary
(06:34) Summary of degradation pathways
(06:57) Chain-of-thought reasoning is currently the most informative monitoring signal, but it is under significant pressure.
(07:34) Action-only monitoring provides a floor for oversight, but it is not sufficient on its own.
(08:01) Evaluation gaming is a growing threat to auditing.
(08:32) Changes in architecture for memory and learning could undermine oversight.
(09:02) White-box methods are a promising backstop, but are not yet mature enough to compensate for degradation elsewhere.
(09:44) Training-based approaches are promising but face fundamental challenges around generalisation.
(10:22) Expert disagreements
(11:40) Our Recommendations
The original text contained 17 footnotes which were omitted from this narration.
---
First published:
May 21st, 2026
---
Narrated by TYPE III AUDIO.
---
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us