Which AI coding agent is actually winning? Terminal-Bench scores agents on hard, real command-line tasks - package management, builds, git, server config, shell scripting. This is the live ranking, mirrored here in one clean board so you (and your favorite LLM) can read it at a glance.
Scores are mirrored from the Terminal-Bench leaderboard at tbench.ai - a Stanford x Laude Institute benchmark. All credit for the benchmark and results goes to the Terminal-Bench team. Last updated Jun 19, 2026, 7:45 PM.
Top 13 entries on terminal-bench 2.1, best score first. Score is the share of tasks the agent solved; +/- is the 95% confidence margin.
| # | Agent | Model | Organization | Score | Submitted |
|---|---|---|---|---|---|
| 1 | Codex CLIverified | GPT-5.5 | OpenAI | 83.4% +/- 2.2 | May 1, 2026 |
| 2 | Claude Codeverified | Claude 5 Fable | Anthropic | 83.1% +/- 2.0 | Jun 17, 2026 |
| 3 | Terminus 2verified | Claude 5 Fable | Terminal-Bench | 80.4% +/- 2.3 | Jun 17, 2026 |
| 4 | Claude Codeverified | Claude Opus 4.8 | Anthropic | 78.9% +/- 2.5 | May 29, 2026 |
| 5 | Terminus 2verified | GPT-5.5 | Terminal-Bench | 78.2% +/- 2.4 | May 1, 2026 |
| 6 | Terminus 2verified | Claude Opus 4.8 | Terminal-Bench | 74.6% +/- 2.4 | May 29, 2026 |
| 7 | Terminus 2verified | Gemini 3 Pro | Terminal-Bench | 74.4% +/- 2.6 | May 1, 2026 |
| 8 | Gemini CLIverified | Gemini 3.1 Pro | 70.7% +/- 2.9 | May 5, 2026 | |
| 9 | Terminus 2verified | Gemini 3.1 Pro | Terminal-Bench | 70.3% +/- 2.9 | May 5, 2026 |
| 10 | Claude Codeverified | Claude Opus 4.7 | Anthropic | 69.7% +/- 2.7 | May 1, 2026 |
| 11 | Gemini CLIverified | Gemini 3 Pro | 66.3% +/- 2.7 | May 2, 2026 | |
| 12 | Terminus 2verified | Claude Opus 4.7 | Terminal-Bench | 66.1% +/- 2.7 | May 1, 2026 |
| 13 | Claude Codeverified | GLM 5.1 | Anthropic | 58.7% +/- 2.4 | May 2, 2026 |
Terminal-Bench is a benchmark for AI agents working in a real terminal. Each task drops an agent into a sandboxed shell and asks it to get something done - fix a broken build, wrangle git, configure a server, write a script - then checks whether the end state is actually correct. The score below is the percentage of tasks an agent solved. It is one of the most realistic public tests of how well a coding agent can operate a computer, which is exactly what DevThrottle helps you do at scale.
Terminal-Bench is created and maintained by the Terminal-Bench team (a Stanford x Laude Institute collaboration). DevThrottle does not run these evaluations - we mirror the published leaderboard and link back to the source. Visit the official Terminal-Bench →
DevThrottle orchestrates command-line coding agents across your machines. Your code never leaves.
Get Early Access - Free