Leaderboard Archive

To view historical performance data for previous models, select a version below.
Back to current leaderboard
  • May 18th 2026 Android LLM Benchmark

    Model Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model
    arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
    Avg latency (h) info Average time taken to solve 100 tasks across 10 runs
    Avg total tokens (M) info Average token consumption for a full benchmark run (100 tasks) across 10 runs
    Avg cost ($) info Average cost per full benchmark run
    Date
    GPT 5.5 74.0
    66.8 — 80.5 15.5 64.5 $133.9 2026-04-27
    GPT 5.4 72.4
    65.4 — 79.3 21.2 64.2 $91.7 2026-03-16
    Gemini 3.1 Pro Preview 72.4
    65.1 — 78.8 11.5 75.4 $49.0 2026-02-27
    Claude Opus 4 7 68.7
    60.5 — 75.9 11.6 90.0 $124.3 2026-04-27
    GPT 5.3 Codex 67.7
    59.9 — 75.6 11.2 71.4 $42.6 2026-03-18
    Claude Opus 4 6 66.6
    59.1 — 74.1 9.9 69.5 $84.4 2026-02-26
    GPT 5.2 Codex 62.5
    54.4 — 70.0 24.3 124.4 $121.9 2026-02-27
    Claude Opus 4.5 61.9
    53.9 — 70.2 12.5 79.8 $102.5 2026-02-26
    Gemini 3 Pro Preview 60.4
    52.3 — 67.7 9.8 117.0 $63.7 2026-02-27
    GLM 5.1 59.7
    52.4 — 67.4 33.4 80.2 $46.7 2026-05-08
    Claude Sonnet 4.6 58.4
    50.3 — 66.4 8.2 47.9 $40.4 2026-03-01
    Kimi K2.6 58.6
    51.3 — 66.5 29.9 94.3 $42.5 2026-05-10
    DeepSeek V4 Pro 55.4
    47.5 — 63.6 35.8 132.7 $13.7 2026-05-08
    Claude Sonnet 4.5 54.2
    45.9 — 62.2 13.1 92.9 $60.3 2026-02-26
    DeepSeek V4 Flash 52.7
    45.3 — 60.7 28.1 164.7 $8.4 2026-05-11
    MiMo 2.5 Pro 52.0
    43.8 — 60.0 33.1 97.5 $74.5 2026-05-09
    Qwen 3.6 Max Preview 51.4
    43.5 — 59.3 20.5 103.0 $222.4 2026-05-07
    Gemini 3 Flash Preview 42.0
    36.6 — 47.3 16.5 148.0 $34.2 2026-02-26
    MiniMax M2.7 37.2
    30.3 — 44.9 20.3 128.3 $10.1 2026-05-01
    Qwen 3.6 27B 37.4
    30.5 — 44.5 20.7 112.3 $64.6 2026-05-05
    Gemma 4 31B IT 33.2
    26.2 — 40.8 14.2 29.5 $2.5 2026-05-01
    Qwen 3.6 35B A3B 31.7
    24.4 — 39.0 12.5 113.4 $10.7 2026-05-05
    Gemini 2.5 Pro 29.1
    22.3 — 36.1 8.4 37.9 $35.8 2026-03-02
    Gemma 4 26B A4B IT 25.1
    18.8 — 31.8 21.4 77.2 $3.3 2026-05-01
    GPT OSS 120B 18.9
    13.1 — 25.1 25.9 122.7 $7.6 2026-05-09
    Gemini 2.5 Flash 15.9
    10.7 — 21.1 4.9 108.8 $11.2 2026-02-26
    Qwen 3.5 9B 15.5
    10.1 — 20.9 16.6 181.4 $15.6 2026-05-07
    GPT OSS 20B 2.4
    1.2 — 3.9 3.8 12.0 $0.2 2026-05-11
    Latest results as of May 18th 2026: This refresh includes open-weight models, adding new columns for latency, tokens, and cost.
  • May 5th 2026 Android LLM Benchmark

    Model Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model
    arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
    Date
    GPT 5.5 74.0
    66.8 — 80.5 2026-04-27
    GPT 5.4 72.4
    65.4 — 79.3 2026-03-16
    Gemini 3.1 Pro Preview 72.4
    65.1 — 78.8 2026-02-27
    Claude Opus 4.7 68.7
    61.2 — 76.0 2026-04-27
    GPT 5.3 Codex 67.7
    59.9 — 75.1 2026-03-18
    Claude Opus 4.6 66.6
    59.5 — 73.9 2026-02-26
    GPT 5.2 Codex 62.5
    54.6 — 70.1 2026-02-26
    Claude Opus 4.5 61.9
    53.0 — 70.1 2026-02-26
    Gemini 3 Pro Preview 60.4
    52.3 — 68.2 2026-02-27
    Claude Sonnet 4.6 58.4
    50.4 — 66.5 2026-02-27
    Claude Sonnet 4.5 53.8
    45.5 — 62.2 2026-02-26
    Gemini 3 Flash Preview 42.0
    36.5 — 47.6 2026-02-26
    Gemini 2.5 Flash 16.7
    11.5 — 22.1 2026-02-26
    Latest results as of May 5th 2026: This refresh includes the addition of GPT-5.5 and Claude Opus 4.7.
  • April 7th 2026 Android LLM Benchmark

    Model Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model
    arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
    Date
    GPT-5.4 72.4
    65.1 — 79.3 2026-03-16
    Gemini 3.1 Pro Preview 72.4
    64.8 — 79.3 2026-02-27
    GPT-5.3-Codex 67.7
    60.1 — 74.8 2026-03-18
    Claude Opus 4.6 66.6
    58.5 — 74.0 2026-02-26
    GPT-5.2-Codex 62.5
    54.8 — 69.8 2026-02-26
    Claude Opus 4.5 61.9
    53.8 — 70.3 2026-02-26
    Gemini 3 Pro Preview 60.4
    52.4 — 68.1 2026-02-27
    Claude Sonnet 4.6 58.4
    50.9 — 66.5 2026-02-27
    Claude Sonnet 4.5 54.2
    46.0 — 62.1 2026-02-26
    Gemini 3 Flash Preview 42.0
    36.4 — 47.7 2026-02-26
    Gemini 2.5 Flash 16.1
    11.2 — 21.2 2026-02-26
    Latest results as of April 7th 2026: This refresh includes the addition of GPT-5.4 and GPT-5.3-Codex.
  • March 5th 2026 Android LLM Benchmark

    Model Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model
    arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
    Date
    Gemini 3.1 Pro Preview 72.4
    65.3 — 79.8 2026-02-27
    Claude Opus 4.6 66.6
    58.9 — 73.9 2026-02-26
    GPT-5.2-Codex 62.5
    54.7 — 70.3 2026-02-26
    Claude Opus 4.5 61.9
    53.9 — 69.6 2026-02-26
    Gemini 3 Pro Preview 60.4
    52.6 — 67.8 2026-02-27
    Claude Sonnet 4.6 58.4
    51.1 — 66.6 2026-02-27
    Claude Sonnet 4.5 54.2
    45.5 — 62.4 2026-02-26
    Gemini 3 Flash Preview 42.0
    36.3 — 47.9 2026-02-26
    Gemini 2.5 Flash 16.1
    10.9 — 21.9 2026-02-26
    Latest results as of March 5th 2026.