Many Benchmarks Scores Would Appear Much Higher If You Let The AIs Use Adequate Labor
Or: measured benchmark performance is often understated by not giving the AIs a chance to use their labor effectively.
Mid last year, Joel tweeted:
there’s a paper begging to be written called “most SWE-bench PRs would not be merged into main”
Our super-talented collaborator Parker Whitfill jumped on the case, leading our eventual post “Many SWE-bench-Passing PRs Would Not Be Merged into Main.”1
The next natural paper in this series — stating a somewhat-obviously directionally-true claim plainly, which is already clear to experts but widely under-appreciated — might be “Many Benchmarks Scores Would Appear Much Higher If You Let The AIs Use Adequate2 Labor.”
Straight lines on graphs
In Joel’s previous post, he wrote about the unfair and frankly perplexing power of “straight lines on graphs.”
Here’s just one of many, many instances where one of us have felt humbled to have been beaten by the dumbest possible algorithm: understanding inference scaling on METR’s RE-Bench.
We had a quiet confidence before collecting data that these open-ended machine learning research engineering environments were exceptionally hard, far beyond the capability level of the AIs at the time (e.g. o1-preview, Sonnet 3.5). So when the first graphs came in showing persistent improvement over time3, we all thought the curves would quickly bend.
It really seemed like the models must surely plateau with inference compute. They’re so derpy! Perhaps you can even notice a slight leveling off between 12 and 20 samples.
But, as has happened so many other times before and since, you collect more data and find that the line never quite ends up bending.4
RE-Bench is not the exception
Many other benchmarks likely also have the property that measured performance can significantly understate what performance would look like if AIs were allowed to use adequate labor.5
On PostTrainBench, agents get 10 hours on a single H100 GPU to post-train a base LLM on a given benchmark. But AIs typically terminate before the 10-hour limit — Sonnet 4.5 and GPT-5.x Codex models stop after around 2-4 hours on average — meanwhile the authors note that longer runs correlate with higher performance. We think it is likely that these AIs are underelicited. And, of course, even if AIs did work for the full 10 hours, their token spend would still be orders of magnitude less than the cost of the officially post-trained versions of each base model — the only comparison to humans that exists for this benchmark.
On OpenAI’s PaperBench, agents attempt to replicate 20 ICML 2024 papers from scratch. In the original paper, human machine learning PhDs (best of 3) achieve around 2x the score of best AI agent, but humans get 4x the time budget. This difference in time budgets was perhaps a reasonable decision at the time of publication (April 2025), when the most capable models plateaued in performance6 after just a few hours of work — but we expect April 2026 models could use a much larger budget more effectively.
On WeirdML, agents attack 19 non-standard machine learning engineering challenges. The benchmark author estimates that each task in the benchmark would probably take a human a few hours to complete (100s of dollars of labor), but agents spend at most a few dollars per run. It’s likely that recent AIs would do much better — possibly saturating the benchmark — if they were given a comparable spend to these hypothetical humans.
Let the tokens flow
No shame to the benchmark authors — collecting many AI attempts on many tasks can get expensive quickly. It’s totally reasonable for authors to prioritize accordingly. But the state of the science of AI evaluations is worse off for them not being able to justify the expense.
If it were easier for benchmark authors to get dramatically more tokens, we would likely have benchmark scores that better represent the limits of model performance.
It is currently unclear to us at METR what the limits of token scaling are. Some tasks seem clearly better suited to token scaling — for instance, large and decomposable software projects, or ML experiments that reward trying many cheap options. But what circumscribes the types of tasks where token scaling doesn’t help? Perhaps there will be a quippy title in this idea too.
The difference between “many” and “most” comes from normalizing to the rate at which human maintainers approve human-written PRs, which it turns out is a bunch lower than 100%.
By "adequate" labor, we mean that the AI is allowed to spend tokens until its performance plateaus (or, if you care about cost, until it spends more than a human would to accomplish the same task).
The graph below actually considers the number of samples; for a single AI attempt over time, you do in fact typically see a plateau on log(time) scale. But you could trivially recreate the same performance from a greater number of shorter independent attempts by having a scaffold that orchestrates some number of independent attempts in sequence, then takes whichever solution achieves the highest score, which is possible because in this setting because RE-Bench tasks typically have scores which the AI agent can observe live.
The situation is actually a bunch worse than this. In many cases, the number of samples or wall-clock time taken isn’t the right comparison point if what you care about is economic decisions, automated AI R&D, and so on. Instead, you want to compare something like the all-in economic cost of AI agents vs. humans. And, by that comparison, the AIs would likely appear much more performant still.
In some ways RE-Bench probably is easier for AIs relative to other possible tasks and their human time to complete, but not in ways that are so important for the argument we want to make here.





There are lots of reasons to expect logarithmic (at best) returns to algorithms like best-of-k: https://www.lesswrong.com/posts/qPX22TkjY7jkCavj6/better-than-logarithmic-returns-to-reasoning
For various reasons, I’d expect somewhat worse than logarithmic. These days I’m especially (and really only?) interested in things which scale better than that!