Episode Details
Back to Episodes“I underestimated AI capabilities (again)” by Ajeya
Description
Note: This post was crossposted from Planned Obsolescence by the Forum team, with the author's permission. The author may not see or respond to comments on this post.
Revisiting a prediction ten months early
On Jan 14th, I made predictions about AI progress in 2026. My forecasts for software engineering capabilities already feel much too conservative.
In my view, METR (where I now work) has some of the hardest and highest-quality software engineering and ML engineering benchmarks out there, and the most useful framework for making benchmark performance intuitive: we measure a task's difficulty by the amount of time a human expert would take to complete it (called the “time horizon”).[1]
When I made my forecasts last month, the model with the longest measured time horizon on METR's suite of software engineering tasks was Claude Opus 4.5; it could succeed around half the time at software tasks that would take a human software engineer about five hours.[2] Time horizons on software tasks had been doubling a little less than twice a year from 2019 through 2025, which would have implied the state-of-the-art 50% time horizon should be somewhat less than 20 hours by the end of 2026.[3] But there [...]
---
First published:
March 5th, 2026
Source:
https://forum.effectivealtruism.org/posts/h3zcCh49kLE6egrfz/i-underestimated-ai-capabilities-again
---
Narrated by TYPE III AUDIO.