Leaderboard Archive
To view historical performance data for previous models, select a version below.
-
May 18th 2026 Android LLM Benchmark
Model Score (%) Average percentage of 100 test cases successfully resolved across 10 runs for each model arrow_range Cl range (%) Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)Avg latency (h) Average time taken to solve 100 tasks across 10 runsAvg total tokens (M) Average token consumption for a full benchmark run (100 tasks) across 10 runsAvg cost ($) Average cost per full benchmark runDate GPT 5.5
74.0 66.8 — 80.5 15.5 64.5 $133.9 2026-04-27 GPT 5.4
72.4 65.4 — 79.3 21.2 64.2 $91.7 2026-03-16 Gemini 3.1 Pro Preview
72.4 65.1 — 78.8 11.5 75.4 $49.0 2026-02-27 Claude Opus 4 7
68.7 60.5 — 75.9 11.6 90.0 $124.3 2026-04-27 GPT 5.3 Codex
67.7 59.9 — 75.6 11.2 71.4 $42.6 2026-03-18 Claude Opus 4 6
66.6 59.1 — 74.1 9.9 69.5 $84.4 2026-02-26 GPT 5.2 Codex
62.5 54.4 — 70.0 24.3 124.4 $121.9 2026-02-27 Claude Opus 4.5
61.9 53.9 — 70.2 12.5 79.8 $102.5 2026-02-26 Gemini 3 Pro Preview
60.4 52.3 — 67.7 9.8 117.0 $63.7 2026-02-27 GLM 5.1
59.7 52.4 — 67.4 33.4 80.2 $46.7 2026-05-08 Claude Sonnet 4.6
58.4 50.3 — 66.4 8.2 47.9 $40.4 2026-03-01 Kimi K2.6
58.6 51.3 — 66.5 29.9 94.3 $42.5 2026-05-10 DeepSeek V4 Pro
55.4 47.5 — 63.6 35.8 132.7 $13.7 2026-05-08 Claude Sonnet 4.5
54.2 45.9 — 62.2 13.1 92.9 $60.3 2026-02-26 DeepSeek V4 Flash
52.7 45.3 — 60.7 28.1 164.7 $8.4 2026-05-11 MiMo 2.5 Pro
52.0 43.8 — 60.0 33.1 97.5 $74.5 2026-05-09 Qwen 3.6 Max Preview
51.4 43.5 — 59.3 20.5 103.0 $222.4 2026-05-07 Gemini 3 Flash Preview
42.0 36.6 — 47.3 16.5 148.0 $34.2 2026-02-26 MiniMax M2.7
37.2 30.3 — 44.9 20.3 128.3 $10.1 2026-05-01 Qwen 3.6 27B
37.4 30.5 — 44.5 20.7 112.3 $64.6 2026-05-05 Gemma 4 31B IT
33.2 26.2 — 40.8 14.2 29.5 $2.5 2026-05-01 Qwen 3.6 35B A3B
31.7 24.4 — 39.0 12.5 113.4 $10.7 2026-05-05 Gemini 2.5 Pro
29.1 22.3 — 36.1 8.4 37.9 $35.8 2026-03-02 Gemma 4 26B A4B IT
25.1 18.8 — 31.8 21.4 77.2 $3.3 2026-05-01 GPT OSS 120B
18.9 13.1 — 25.1 25.9 122.7 $7.6 2026-05-09 Gemini 2.5 Flash
15.9 10.7 — 21.1 4.9 108.8 $11.2 2026-02-26 Qwen 3.5 9B
15.5 10.1 — 20.9 16.6 181.4 $15.6 2026-05-07 GPT OSS 20B
2.4 1.2 — 3.9 3.8 12.0 $0.2 2026-05-11 Latest results as of May 18th 2026: This refresh includes open-weight models, adding new columns for latency, tokens, and cost. -
May 5th 2026 Android LLM Benchmark
Model Score (%) Average percentage of 100 test cases successfully resolved across 10 runs for each model arrow_range Cl range (%) Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)Date GPT 5.5
74.0 66.8 — 80.5 2026-04-27 GPT 5.4
72.4 65.4 — 79.3 2026-03-16 Gemini 3.1 Pro Preview
72.4 65.1 — 78.8 2026-02-27 Claude Opus 4.7
68.7 61.2 — 76.0 2026-04-27 GPT 5.3 Codex
67.7 59.9 — 75.1 2026-03-18 Claude Opus 4.6
66.6 59.5 — 73.9 2026-02-26 GPT 5.2 Codex
62.5 54.6 — 70.1 2026-02-26 Claude Opus 4.5
61.9 53.0 — 70.1 2026-02-26 Gemini 3 Pro Preview
60.4 52.3 — 68.2 2026-02-27 Claude Sonnet 4.6
58.4 50.4 — 66.5 2026-02-27 Claude Sonnet 4.5
53.8 45.5 — 62.2 2026-02-26 Gemini 3 Flash Preview
42.0 36.5 — 47.6 2026-02-26 Gemini 2.5 Flash
16.7 11.5 — 22.1 2026-02-26 Latest results as of May 5th 2026: This refresh includes the addition of GPT-5.5 and Claude Opus 4.7. -
April 7th 2026 Android LLM Benchmark
Model Score (%) Average percentage of 100 test cases successfully resolved across 10 runs for each model arrow_range Cl range (%) Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)Date GPT-5.4
72.4 65.1 — 79.3 2026-03-16 Gemini 3.1 Pro Preview
72.4 64.8 — 79.3 2026-02-27 GPT-5.3-Codex
67.7 60.1 — 74.8 2026-03-18 Claude Opus 4.6
66.6 58.5 — 74.0 2026-02-26 GPT-5.2-Codex
62.5 54.8 — 69.8 2026-02-26 Claude Opus 4.5
61.9 53.8 — 70.3 2026-02-26 Gemini 3 Pro Preview
60.4 52.4 — 68.1 2026-02-27 Claude Sonnet 4.6
58.4 50.9 — 66.5 2026-02-27 Claude Sonnet 4.5
54.2 46.0 — 62.1 2026-02-26 Gemini 3 Flash Preview
42.0 36.4 — 47.7 2026-02-26 Gemini 2.5 Flash
16.1 11.2 — 21.2 2026-02-26 Latest results as of April 7th 2026: This refresh includes the addition of GPT-5.4 and GPT-5.3-Codex. -
March 5th 2026 Android LLM Benchmark
Model Score (%) Average percentage of 100 test cases successfully resolved across 10 runs for each model arrow_range Cl range (%) Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)Date Gemini 3.1 Pro Preview
72.4 65.3 — 79.8 2026-02-27 Claude Opus 4.6
66.6 58.9 — 73.9 2026-02-26 GPT-5.2-Codex
62.5 54.7 — 70.3 2026-02-26 Claude Opus 4.5
61.9 53.9 — 69.6 2026-02-26 Gemini 3 Pro Preview
60.4 52.6 — 67.8 2026-02-27 Claude Sonnet 4.6
58.4 51.1 — 66.6 2026-02-27 Claude Sonnet 4.5
54.2 45.5 — 62.4 2026-02-26 Gemini 3 Flash Preview
42.0 36.3 — 47.9 2026-02-26 Gemini 2.5 Flash
16.1 10.9 — 21.9 2026-02-26 Latest results as of March 5th 2026.