Episode Details
Back to Episodes“Benchmarking Real Work” by kaivu, leni, rohuang, zef
Description
Thanks to Megan Kinniment for helpful comments and discussion.
TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capacity: we can either try to build automated judges that match human judgment, or reduce the human effort per grade. To do this, we propose generating fuzzy tasks as a byproduct of real SWE work — snapshot the repo and a proto-spec before starting, and after finishing, use an AI transform to produce an executable spec and LLM-judge conditions. Because the engineer just did the work, verifying the judges or grading the agent directly is much cheaper than grading the task from scratch. I think this would be a good way to collect tasks, as well as a useful personal epistemic tool.
This is a two-part series on capability evaluation. Part 1 is about acquiring fuzzy tasks, and part 2 is about analyzing them.
Motivation: sampling bias in HCAST
There are several well-described limitations of time horizons. But the strongest reason that I don’t update that much on trends in time horizons (and time horizon-like tasks) is because I think all existing evaluations [...]
---
Outline:
(01:14) Motivation: sampling bias in HCAST
(02:47) Making fuzzy tasks sampling viable by increasing judge capacity
(04:02) Proposal: sampling from real work
(05:18) Advantages
(06:10) Discussion
(06:13) How inconvenient is this?
(06:32) Can we test fuzzy skills by just testing longer tasks?
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
May 16th, 2026
Source:
https://www.lesswrong.com/posts/NbDjD47u6WmthgiDC/benchmarking-real-work
---
Narrated by TYPE III AUDIO.