Live leaderboard mirrorterminal-bench 2.1

Terminal-Bench Scoreboard

Which AI coding agent is actually winning? Terminal-Bench scores agents on hard, real command-line tasks - package management, builds, git, server config, shell scripting. This is the live ranking, mirrored here in one clean board so you (and your favorite LLM) can read it at a glance.

Scores are mirrored from the Terminal-Bench leaderboard at tbench.ai - a Stanford x Laude Institute benchmark. All credit for the benchmark and results goes to the Terminal-Bench team. Last updated Jun 19, 2026, 7:45 PM.

Full ranking

Top 13 entries on terminal-bench 2.1, best score first. Score is the share of tasks the agent solved; +/- is the 95% confidence margin.

#AgentModelOrganizationScoreSubmitted
1Codex CLIverifiedGPT-5.5OpenAI83.4% +/- 2.2May 1, 2026
2Claude CodeverifiedClaude 5 FableAnthropic83.1% +/- 2.0Jun 17, 2026
3Terminus 2verifiedClaude 5 FableTerminal-Bench80.4% +/- 2.3Jun 17, 2026
4Claude CodeverifiedClaude Opus 4.8Anthropic78.9% +/- 2.5May 29, 2026
5Terminus 2verifiedGPT-5.5Terminal-Bench78.2% +/- 2.4May 1, 2026
6Terminus 2verifiedClaude Opus 4.8Terminal-Bench74.6% +/- 2.4May 29, 2026
7Terminus 2verifiedGemini 3 ProTerminal-Bench74.4% +/- 2.6May 1, 2026
8Gemini CLIverifiedGemini 3.1 ProGoogle70.7% +/- 2.9May 5, 2026
9Terminus 2verifiedGemini 3.1 ProTerminal-Bench70.3% +/- 2.9May 5, 2026
10Claude CodeverifiedClaude Opus 4.7Anthropic69.7% +/- 2.7May 1, 2026
11Gemini CLIverifiedGemini 3 ProGoogle66.3% +/- 2.7May 2, 2026
12Terminus 2verifiedClaude Opus 4.7Terminal-Bench66.1% +/- 2.7May 1, 2026
13Claude CodeverifiedGLM 5.1Anthropic58.7% +/- 2.4May 2, 2026

What Terminal-Bench measures

Terminal-Bench is a benchmark for AI agents working in a real terminal. Each task drops an agent into a sandboxed shell and asks it to get something done - fix a broken build, wrangle git, configure a server, write a script - then checks whether the end state is actually correct. The score below is the percentage of tasks an agent solved. It is one of the most realistic public tests of how well a coding agent can operate a computer, which is exactly what DevThrottle helps you do at scale.

Terminal-Bench is created and maintained by the Terminal-Bench team (a Stanford x Laude Institute collaboration). DevThrottle does not run these evaluations - we mirror the published leaderboard and link back to the source. Visit the official Terminal-Bench →

Run the winning agents - all from one control room.

DevThrottle orchestrates command-line coding agents across your machines. Your code never leaves.

Get Early Access - Free