Android Bench | Android Developers

June 9th 2026 Android LLM Benchmark

Model	Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model	arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)	Avg latency (h) info Average time taken to solve 100 tasks across 10 runs	Avg cost ($) info Average cost per full benchmark run
GPT 5.5	74.0	66.9 — 80.6	15.7	$134.2
GPT 5.4	72.4	65.4 — 79.0	21.2	$91.7
Gemini 3.1 Pro Preview	72.4	65.2 — 79.0	11.1	$47.9
Claude Opus 4.7	68.7	61.0 — 76.0	11.6	$124.3
Claude Opus 4.6	66.6	59.2 — 74.0	9.9	$84.4
Gemini 3.5 Flash	63.7	56.3 — 70.7	14.2	$147.1
GLM 5.1	59.7	52.1 — 67.4	33.4	$46.7
Kimi K2.6	58.6	51.3 — 66.1	29.9	$42.5
Claude Sonnet 4.6	58.4	50.3 — 66.3	8.2	$40.4
DeepSeek V4 Pro	55.4	47.9 — 63.5	35.8	$13.7
Claude Sonnet 4.5	53.7	46.1 — 61.4	13.1	$61.0
DeepSeek V4 Flash	52.7	45.1 — 60.2	28.1	$8.4
MiMo 2.5 Pro	52.0	43.6 — 59.4	33.1	$74.5
Qwen 3.6 Max Preview	51.4	44.1 — 59.2	20.5	$222.4
Gemini 3 Flash Preview	42.0	36.2 — 48.2	16.5	$34.2
Qwen 3.6 27B	37.4	30.3 — 44.7	20.7	$64.6
MiniMax M2.7	37.2	30.2 — 44.3	20.3	$10.1
Gemma 4 31B IT	33.2	26.0 — 40.8	14.2	$2.5
Qwen 3.6 35B A3B	31.7	24.7 — 39.0	12.5	$10.7
Gemma 4 26B A4B IT	25.1	18.6 — 31.8	21.4	$3.3
GPT OSS 120B	18.9	13.3 — 24.7	25.9	$7.6
Qwen 3.5 9B	15.5	10.4 — 21.1	16.6	$15.6
GPT OSS 20B	2.4	1.1 — 3.8	3.8	$0.2

Latest results as of June 9th.
View archived leaderboards and check back periodically for updates.

May 18th 2026 Android LLM Benchmark

Model	Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model	arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)	Avg latency (h) info Average time taken to solve 100 tasks across 10 runs	Avg cost ($) info Average cost per full benchmark run
GPT 5.5	74.0	66.8 — 80.5	15.5	$133.9
GPT 5.4	72.4	65.4 — 79.3	21.2	$91.7
Gemini 3.1 Pro Preview	72.4	65.1 — 78.8	11.5	$49.0
Claude Opus 4 7	68.7	60.5 — 75.9	11.6	$124.3
GPT 5.3 Codex	67.7	59.9 — 75.6	11.2	$42.6
Claude Opus 4 6	66.6	59.1 — 74.1	9.9	$84.4
GPT 5.2 Codex	62.5	54.4 — 70.0	24.3	$121.9
Claude Opus 4.5	61.9	53.9 — 70.2	12.5	$102.5
Gemini 3 Pro Preview	60.4	52.3 — 67.7	9.8	$63.7
GLM 5.1	59.7	52.4 — 67.4	33.4	$46.7
Claude Sonnet 4.6	58.4	50.3 — 66.4	8.2	$40.4
Kimi K2.6	58.6	51.3 — 66.5	29.9	$42.5
DeepSeek V4 Pro	55.4	47.5 — 63.6	35.8	$13.7
Claude Sonnet 4.5	54.2	45.9 — 62.2	13.1	$60.3
DeepSeek V4 Flash	52.7	45.3 — 60.7	28.1	$8.4
MiMo 2.5 Pro	52.0	43.8 — 60.0	33.1	$74.5
Qwen 3.6 Max Preview	51.4	43.5 — 59.3	20.5	$222.4
Gemini 3 Flash Preview	42.0	36.6 — 47.3	16.5	$34.2
MiniMax M2.7	37.2	30.3 — 44.9	20.3	$10.1
Qwen 3.6 27B	37.4	30.5 — 44.5	20.7	$64.6
Gemma 4 31B IT	33.2	26.2 — 40.8	14.2	$2.5
Qwen 3.6 35B A3B	31.7	24.4 — 39.0	12.5	$10.7
Gemini 2.5 Pro	29.1	22.3 — 36.1	8.4	$35.8
Gemma 4 26B A4B IT	25.1	18.8 — 31.8	21.4	$3.3
GPT OSS 120B	18.9	13.1 — 25.1	25.9	$7.6
Gemini 2.5 Flash	15.9	10.7 — 21.1	4.9	$11.2
Qwen 3.5 9B	15.5	10.1 — 20.9	16.6	$15.6
GPT OSS 20B	2.4	1.2 — 3.9	3.8	$0.2

Latest results as of May 18th.
View archived leaderboards and check back periodically for updates.

May 5th 2026 Android LLM Benchmark

Model	Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model	arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
GPT 5.5	74.0	66.8 — 80.5
GPT 5.4	72.4	65.4 — 79.3
Gemini 3.1 Pro Preview	72.4	65.1 — 78.8
Claude Opus 4.7	68.7	61.2 — 76.0
GPT 5.3 Codex	67.7	59.9 — 75.1
Claude Opus 4.6	66.6	59.5 — 73.9
GPT 5.2 Codex	62.5	54.6 — 70.1
Claude Opus 4.5	61.9	53.0 — 70.1
Gemini 3 Pro Preview	60.4	52.3 — 68.2
Claude Sonnet 4.6	58.4	50.4 — 66.5
Claude Sonnet 4.5	53.8	45.5 — 62.2
Gemini 3 Flash Preview	42.0	36.5 — 47.6
Gemini 2.5 Flash	16.7	11.5 — 22.1

Latest results as of May 5th.
View archived leaderboards and check back periodically for updates.

April 7th 2026 Android LLM Benchmark

Model	Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model	arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
GPT-5.4	72.4	65.1 — 79.3
Gemini 3.1 Pro Preview	72.4	64.8 — 79.3
GPT-5.3-Codex	67.7	60.1 — 74.8
Claude Opus 4.6	66.6	58.5 — 74.0
GPT-5.2-Codex	62.5	54.8 — 69.8
Claude Opus 4.5	61.9	53.8 — 70.3
Gemini 3 Pro Preview	60.4	52.4 — 68.1
Claude Sonnet 4.6	58.4	50.9 — 66.5
Claude Sonnet 4.5	54.2	46.0 — 62.1
Gemini 3 Flash Preview	42.0	36.4 — 47.7
Gemini 2.5 Flash	16.1	11.2 — 21.2

Latest results as of April 7th.
View archived leaderboards and check back periodically for updates.

March 5th 2026 Android LLM Benchmark

Model	Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model	arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
Gemini 3.1 Pro Preview	72.4	65.3 — 79.8
Claude Opus 4.6	66.6	58.9 — 73.9
GPT-5.2-Codex	62.5	54.7 — 70.3
Claude Opus 4.5	61.9	53.9 — 69.6
Gemini 3 Pro Preview	60.4	52.6 — 67.8
Claude Sonnet 4.6	58.4	51.1 — 66.6
Claude Sonnet 4.5	54.2	45.5 — 62.4
Gemini 3 Flash Preview	42.0	36.3 — 47.9
Gemini 2.5 Flash	16.1	10.9 — 21.9

Latest results as of March 5th.
View archived leaderboards and check back periodically for updates.