Android Bench

AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.

Notice: We've updated Android Bench. To learn more about the updates check out our blog and methodology.

Model	Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model	arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)	Avg latency (h) info Average time taken to solve 100 tasks across 10 runs	Avg cost ($) info Average cost per full benchmark run
Claude Fable 5	84.5	79.9 — 88.8	8.0	$133.2
GPT 5.5	80.2	73.5 — 86.6	11.4	$138.3
Claude Sonnet 5	76.2	69.0 — 82.1	12.3	$99.9
GPT 5.4	74.1	66.0 — 80.9	8.4	$83.4
Gemini 3.1 Pro Preview	73.7	66.1 — 80.4	10.6	$87.4
Claude Opus 4.8	72.4	65.8 — 79.3	6.7	$88.0
GLM 5.2	72.2	65.3 — 78.7	38.9	$117.0
Gemini 3.5 Flash	71.1	63.6 — 78.2	28.3	$165.6
Kimi K2.7 Code	70.4	63.2 — 77.0	31.8	$48.1
Claude Opus 4.7	68.7	60.9 — 76.4	7.0	$96.5
Kimi K2.6	67.6	60.2 — 74.3	57.2	$49.4
Claude Sonnet 4.6	67.0	58.3 — 75.4	16.9	$127.6
MiniMax M3	63.6	56.3 — 70.3	26.0	$41.7
GLM 5.1	63.2	56.0 — 71.3	17.6	$53.5
Gemini 3 Flash Preview	62.5	54.2 — 70.0	13.1	$30.1
MiMo-V2.5-Pro	60.8	53.1 — 68.3	13.6	$9.2
Deepseek V4 Pro	59.5	51.7 — 66.9	9.0	$3.7
Qwen 3.7 Plus	57.7	49.5 — 65.5	18.5	$18.6
Deepseek V4 Flash	54.7	46.6 — 62.8	8.9	$1.5
Qwen 3.7 Max	54.2	46.3 — 61.8	14.2	$58.3
Qwen 3.6 27B	45.1	38.2 — 53.0	25.8	$97.3
MiniMax M2.7	41.6	34.4 — 49.0	18.2	$14.9
Qwen 3.6 35B A3B	37.0	29.5 — 44.4	16.3	$17.8
Gemma 4 31B IT	36.3	29.3 — 43.2	38.9	$10.6
Gemma 4 26B A4B IT	25.1	18.6 — 31.8	21.4	$3.3

Latest results as of July 8th.
View archived leaderboards and check back periodically for updates.

Latest Updates

Track the latest AI model benchmarks, newly introduced agent architectures, and continuous performance evaluations on the platform. Stay updated with our routine methodology updates and release logs.

View all updates

New updates • Jul 8th
Dataset available on Harbor
New models • Jul 8th
Claude Fable 5, Claude Sonnet 5, Claude Opus 4.8
New models • Jul 8th
Qwen 3.7 Max, Qwen 3.7 Plus
New models • Jul 8th
GLM 5.2
New models • Jul 8th
Kimi K2.7 Code
New models • Jul 8th
MiniMax M3
Archived models • Jul 8th
Claude Opus 4.6, Claude Sonnet 4.5
Archived models • Jul 8th
GPT OSS 120B, GPT OSS 20B
Archived models • Jul 8th
Qwen 3.5 9B, Qwen 3.6 Max Preview
New updates • Jul 8th
We have migrated our benchmark framework to Harbor
New updates • Jul 8th
We've updated Android Bench

New models • Jun 9th
Gemini 3.5 Flash
Archived models • Jun 9th
Claude Opus 4.6, Claude Opus 4.5
Archived models • Jun 9th
GPT 5.3 Codex, GPT 5.2 Codex
Archived models • Jun 9th
Gemini 3 Pro Preview, Gemini 2.5 Pro, Gemini 2.5 Flash
New updates • Jun 9th
See our new Archive page

Android Bench

Latest Updates

Learn more about Android Bench

Our methodology

Android best practices

Harbor dataset