When people ask which AI coding agent is "best," they usually mean: which one can actually get work done in a real terminal? Terminal-Bench is the benchmark built to answer exactly that. It is a project from a Stanford x Laude Institute collaboration, and its leaderboard has become one of the most-cited measures of agent capability.
What it measures
Each Terminal-Bench task drops an agent into a sandboxed shell and gives it a real job: fix a broken build, resolve a git mess, configure a server, write a script, manipulate files. The agent works the way a developer would - running commands and editing files - and at the end the benchmark checks whether the resulting state is actually correct. There is no partial credit for looking plausible; the task either passes its tests or it does not.
That design is what makes the score meaningful. It is not measuring whether a model can describe how to fix a build - it is measuring whether an agent can actually fix one.
How to read a score
A leaderboard entry's headline number is its accuracy: the percentage of tasks the agent solved. An entry pairs an agent (the harness, such as Claude Code, Codex CLI, or Terminus) with a model (the LLM driving it), because the same model can score differently under different harnesses. You will also see a "plus or minus" figure - a confidence margin - because each agent is run across many trials and some variance is expected.
Two practical takeaways. First, compare agent-and-model pairs, not models alone - the harness matters. Second, treat small gaps as ties: if two entries are within each other's confidence margins, they are effectively even.
Versions
Terminal-Bench evolves. Newer versions (2.0, 2.1, and beyond) add harder, more realistic tasks, so scores are only comparable within the same version. When you read a leaderboard, check which version it is - a number from an older version is not directly comparable to a newer one.
See the current standings
We keep a clean, always-current mirror of the latest Terminal-Bench leaderboard on our Terminal-Bench scoreboard, with full attribution and a link back to the official source. It is a fast way to see which agent-and-model combinations are leading right now, and there is a machine-readable JSON version for tools and research.
Once you have picked strong agents, the next question is running several of them at once - that is what DevThrottle is for, and we cover it in how to run multiple AI coding agents at once.
Run your agents from one control room
DevThrottle orchestrates command-line coding agents across your machines. Your code never leaves.
Get Early Access - Free