Skip to content

⚡ January 29, 2026

Generated: 2026-01-29 09:48 UTC
Total Duration: 56m 56s
Iterations: 1
Judge (classifier) model: gpt-4.1

Fast Benchmark

Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes

HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.

If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.

Model Accuracy Comparison

Model Pass Fail Skip/Error Total Success Rate
deepseek-chat 11 3 0 14 🟡 79% (11/14)
deepseek-reasoner 9 5 0 14 🟡 64% (9/14)
gemini-3-flash-preview 7 7 0 14 🟡 50% (7/14)
gemini-3-pro-preview 8 6 0 14 🟡 57% (8/14)
gpt-5.2-high-reasoning 9 5 0 14 🟡 64% (9/14)
haiku-4.5 10 4 0 14 🟡 71% (10/14)
kimi-2.5-openrouter 11 3 0 14 🟡 79% (11/14)
opus-4.5 12 2 0 14 🟡 86% (12/14)
sonnet-4.5 13 1 0 14 🟡 93% (13/14)

Model Cost Comparison

Model Tests Avg Cost Min Cost Max Cost Total Cost
deepseek-chat 14 $0.01 $0.00 $0.03 $0.20
deepseek-reasoner 14 $0.02 $0.00 $0.03 $0.23
gemini-3-flash-preview 13 $0.06 $0.02 $0.16 $0.80
gemini-3-pro-preview 11 $0.14 $0.04 $0.25 $1.59
gpt-5.2-high-reasoning 14 $0.25 $0.01 $0.75 $3.56
haiku-4.5 14 $0.05 $0.02 $0.10 $0.76
opus-4.5 14 $0.25 $0.11 $0.38 $3.45
sonnet-4.5 14 $0.19 $0.07 $0.31 $2.64

Model Latency Comparison

Model Avg (s) Min (s) Max (s) P50 (s) P95 (s)
deepseek-chat 131.8 52.5 205.2 143.4 205.2
deepseek-reasoner 222.4 30.3 430.3 255.2 430.3
gemini-3-flash-preview 38.0 11.3 104.3 34.6 104.3
gemini-3-pro-preview 69.8 7.0 128.0 76.1 128.0
gpt-5.2-high-reasoning 256.1 10.7 836.0 254.6 836.0
haiku-4.5 27.4 7.6 56.3 27.8 56.3
kimi-2.5-openrouter 70.7 12.2 218.4 66.0 218.4
opus-4.5 38.3 11.4 62.5 38.1 62.5
sonnet-4.5 39.7 7.6 66.8 42.5 66.8

Performance by Tag

Success rate by test category and model:

Tag deepseek-chat deepseek-reasoner gemini-3-flash-preview gemini-3-pro-preview gpt-5.2-high-reasoning haiku-4.5 kimi-2.5-openrouter opus-4.5 sonnet-4.5 Warnings
benchmark 🟡 60% (⅗) 🟡 60% (⅗) 🟡 60% (⅗) 🟡 60% (⅗) 🟡 60% (⅗) 🟡 60% (⅗) 🟡 60% (⅗) 🟡 60% (⅗) 🟡 80% (⅘)
context_window 🟢 100% (2/2) 🟡 50% (½) 🟡 50% (½) 🔴 0% (0/2) 🟢 100% (2/2) 🟡 50% (½) 🟡 50% (½) 🟢 100% (2/2) 🟢 100% (2/2)
counting 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
datetime 🟡 67% (⅔) 🟡 67% (⅔) 🟡 33% (⅓) 🔴 0% (0/3) 🟢 100% (3/3) 🟡 67% (⅔) 🟡 67% (⅔) 🟢 100% (3/3) 🟢 100% (3/3)
easy 🟡 86% (6/7) 🟡 57% (4/7) 🟡 43% (3/7) 🟡 57% (4/7) 🟡 71% (5/7) 🟡 86% (6/7) 🟢 100% (7/7) 🟢 100% (7/7) 🟢 100% (7/7)
grafana-dashboard 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
hard 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
kubernetes 🟢 100% (7/7) 🟡 57% (4/7) 🟡 57% (4/7) 🟡 57% (4/7) 🟡 57% (4/7) 🟡 71% (5/7) 🟡 86% (6/7) 🟢 100% (7/7) 🟢 100% (7/7)
logs 🟡 75% (¾) 🟡 25% (¼) 🟡 50% (2/4) 🟡 25% (¼) 🟡 75% (¾) 🟡 50% (2/4) 🟡 50% (2/4) 🟡 75% (¾) 🟡 75% (¾)
loki 🟢 100% (1/1) 🔴 0% (0/1) 🔴 0% (0/1) 🔴 0% (0/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
medium 🟡 67% (4/6) 🟡 67% (4/6) 🟡 50% (3/6) 🟡 50% (3/6) 🟡 50% (3/6) 🟡 50% (3/6) 🟡 50% (3/6) 🟡 67% (4/6) 🟡 83% (⅚)
metrics 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
network 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🔴 0% (0/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
one-test 🟢 100% (1/1) 🔴 0% (0/1) 🟢 100% (1/1) 🟢 100% (1/1) 🔴 0% (0/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
port-forward 🟢 100% (2/2) 🟡 50% (½) 🟡 50% (½) 🟡 50% (½) 🟢 100% (2/2) 🟢 100% (2/2) 🟢 100% (2/2) 🟢 100% (2/2) 🟢 100% (2/2)
question-answer 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1) 🟢 100% (1/1)
regression 🟡 89% (8/9) 🟡 67% (6/9) 🟡 44% (4/9) 🟡 56% (5/9) 🟡 67% (6/9) 🟡 78% (7/9) 🟡 89% (8/9) 🟢 100% (9/9) 🟢 100% (9/9)
runbooks 🟡 50% (½) 🟢 100% (2/2) 🔴 0% (0/2) 🟡 50% (½) 🟡 50% (½) 🟡 50% (½) 🟡 50% (½) 🟡 50% (½) 🟢 100% (2/2)
Overall 🟡 79% (11/14) 🟡 64% (9/14) 🟡 50% (7/14) 🟡 57% (8/14) 🟡 64% (9/14) 🟡 71% (10/14) 🟡 79% (11/14) 🟡 86% (12/14) 🟡 93% (13/14)

Raw Results

Status of all evaluations across models. Color coding:

  • 🟢 Passing 100% (stable)
  • 🟡 Passing 1-99%
  • 🔴 Passing 0% (failing)
  • 🔧 Mock data failure (missing or invalid test data)
  • ⚠️ Setup failure (environment/infrastructure issue)
  • ⏱️ Timeout or rate limit error
  • ⏭️ Test skipped (e.g., known issue or precondition not met)
Eval ID deepseek-chat deepseek-reasoner gemini-3-flash-preview gemini-3-pro-preview gpt-5.2-high-reasoning haiku-4.5 kimi-2.5-openrouter opus-4.5 sonnet-4.5
09_crashpod 🔗 🟢 🔴 🟢 🟢 🔴 🟢 🟢 🟢 🟢
101_loki_historical_logs_pod_deleted 🔗 🟢 🔴 🔴 🔴 🟢 🟢 🟢 🟢 🟢
108_logs_nearby_lines 🔗 🔴 🔴 🟢 🟢 🔴 🔴 🔴 🔴 🔴
111_pod_names_contain_service 🔗 🟢 🟢 🟢 🟢 🔴 🟢 🟢 🟢 🟢
12_job_crashing 🔗 🟢 🟢 🔴 🟢 🟢 🟢 🟢 🟢 🟢
162_get_runbooks 🔗 🟢 🟢 🔴 🔴 🟢 🔴 🔴 🟢 🟢
176_network_policy_blocking_traffic_no_runbooks 🔗 🟢 🟢 🟢 🔴 🟢 🟢 🟢 🟢 🟢
179_grafana_big_dashboard_query 🔗 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢
24_misconfigured_pvc 🔗 🟢 🔴 🔴 🟢 🔴 🔴 🟢 🟢 🟢
43_current_datetime_from_prompt 🔗 🔴 🟢 🔴 🔴 🟢 🟢 🟢 🟢 🟢
61_exact_match_counting 🔗 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢
73a_time_window_anomaly 🔗 🟢 🔴 🟢 🔴 🟢 🟢 🔴 🟢 🟢
73b_time_window_anomaly 🔗 🟢 🟢 🔴 🔴 🟢 🔴 🟢 🟢 🟢
96_no_matching_runbook 🔗 🔴 🟢 🔴 🟢 🔴 🟢 🟢 🔴 🟢
SUMMARY 🟡 79% (11/14) 🟡 64% (9/14) 🟡 50% (7/14) 🟡 57% (8/14) 🟡 64% (9/14) 🟡 71% (10/14) 🟡 79% (11/14) 🟡 86% (12/14) 🟡 93% (13/14)

Detailed Raw Results

Eval ID deepseek-chat deepseek-reasoner gemini-3-flash-preview gemini-3-pro-preview gpt-5.2-high-reasoning haiku-4.5 kimi-2.5-openrouter opus-4.5 sonnet-4.5
09_crashpod 🔗 🟢 100% (1/1) / ⏱️ 152.2s / 💰 $0.01 🔴 0% (0/1) / ⏱️ 44.8s / 💰 $0.00 🟢 100% (1/1) / ⏱️ 25.8s / 💰 $0.04 🟢 100% (1/1) / ⏱️ 47.2s / 💰 $0.09 🔴 0% (0/1) / ⏱️ 46.9s / 💰 $0.03 🟢 100% (1/1) / ⏱️ 27.8s / 💰 $0.06 🟢 100% (1/1) / ⏱️ 62.3s 🟢 100% (1/1) / ⏱️ 36.5s / 💰 $0.21 🟢 100% (1/1) / ⏱️ 37.2s / 💰 $0.18
101_loki_historical_logs_pod_deleted 🔗 🟢 100% (1/1) / ⏱️ 157.2s / 💰 $0.02 🔴 0% (0/1) / ⏱️ 430.3s / 💰 $0.03 🔴 0% (0/1) / ⏱️ 104.3s 🔴 0% (0/1) / ⏱️ 79.4s 🟢 100% (1/1) / ⏱️ 590.4s / 💰 $0.62 🟢 100% (1/1) / ⏱️ 32.0s / 💰 $0.06 🟢 100% (1/1) / ⏱️ 97.7s 🟢 100% (1/1) / ⏱️ 36.4s / 💰 $0.20 🟢 100% (1/1) / ⏱️ 42.5s / 💰 $0.18
108_logs_nearby_lines 🔗 🔴 0% (0/1) / ⏱️ 143.4s / 💰 $0.02 🔴 0% (0/1) / ⏱️ 323.1s / 💰 $0.03 🟢 100% (1/1) / ⏱️ 52.2s / 💰 $0.08 🟢 100% (1/1) / ⏱️ 97.4s / 💰 $0.22 🔴 0% (0/1) / ⏱️ 493.6s / 💰 $0.44 🔴 0% (0/1) / ⏱️ 31.3s / 💰 $0.06 🔴 0% (0/1) / ⏱️ 34.6s 🔴 0% (0/1) / ⏱️ 49.2s / 💰 $0.38 🔴 0% (0/1) / ⏱️ 58.4s / 💰 $0.24
111_pod_names_contain_service 🔗 🟢 100% (1/1) / ⏱️ 130.2s / 💰 $0.01 🟢 100% (1/1) / ⏱️ 193.3s / 💰 $0.01 🟢 100% (1/1) / ⏱️ 22.8s / 💰 $0.03 🟢 100% (1/1) / ⏱️ 39.4s / 💰 $0.06 🔴 0% (0/1) / ⏱️ 29.5s / 💰 $0.01 🟢 100% (1/1) / ⏱️ 25.9s / 💰 $0.06 🟢 100% (1/1) / ⏱️ 73.3s 🟢 100% (1/1) / ⏱️ 37.6s / 💰 $0.23 🟢 100% (1/1) / ⏱️ 35.8s / 💰 $0.15
12_job_crashing 🔗 🟢 100% (1/1) / ⏱️ 192.4s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 226.7s / 💰 $0.01 🔴 0% (0/1) / ⏱️ 51.3s / 💰 $0.10 🟢 100% (1/1) / ⏱️ 128.0s / 💰 $0.24 🟢 100% (1/1) / ⏱️ 348.2s / 💰 $0.32 🟢 100% (1/1) / ⏱️ 35.9s / 💰 $0.07 🟢 100% (1/1) / ⏱️ 81.4s 🟢 100% (1/1) / ⏱️ 38.1s / 💰 $0.24 🟢 100% (1/1) / ⏱️ 40.4s / 💰 $0.20
162_get_runbooks 🔗 🟢 100% (1/1) / ⏱️ 129.1s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 265.0s / 💰 $0.03 🔴 0% (0/1) / ⏱️ 39.2s / 💰 $0.07 🔴 0% (0/1) / ⏱️ 62.4s / 💰 $0.20 🟢 100% (1/1) / ⏱️ 261.5s / 💰 $0.30 🔴 0% (0/1) / ⏱️ 27.7s / 💰 $0.06 🔴 0% (0/1) / ⏱️ 41.2s 🟢 100% (1/1) / ⏱️ 44.0s / 💰 $0.32 🟢 100% (1/1) / ⏱️ 53.0s / 💰 $0.25
176_network_policy_blocking_traffic_no_runbooks 🔗 🟢 100% (1/1) / ⏱️ 167.9s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 307.8s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 41.8s / 💰 $0.05 🔴 0% (0/1) / ⏱️ 7.0s 🟢 100% (1/1) / ⏱️ 254.6s / 💰 $0.21 🟢 100% (1/1) / ⏱️ 37.0s / 💰 $0.06 🟢 100% (1/1) / ⏱️ 85.7s 🟢 100% (1/1) / ⏱️ 52.7s / 💰 $0.32 🟢 100% (1/1) / ⏱️ 44.7s / 💰 $0.21
179_grafana_big_dashboard_query 🔗 🟢 100% (1/1) / ⏱️ 62.9s / 💰 $0.01 🟢 100% (1/1) / ⏱️ 154.6s / 💰 $0.01 🟢 100% (1/1) / ⏱️ 16.5s / 💰 $0.03 🟢 100% (1/1) / ⏱️ 40.4s / 💰 $0.09 🟢 100% (1/1) / ⏱️ 46.5s / 💰 $0.09 🟢 100% (1/1) / ⏱️ 23.8s / 💰 $0.05 🟢 100% (1/1) / ⏱️ 46.0s 🟢 100% (1/1) / ⏱️ 24.1s / 💰 $0.19 🟢 100% (1/1) / ⏱️ 21.5s / 💰 $0.12
24_misconfigured_pvc 🔗 🟢 100% (1/1) / ⏱️ 97.1s / 💰 $0.01 🔴 0% (0/1) / ⏱️ 30.3s / 💰 $0.00 🔴 0% (0/1) / ⏱️ 17.8s / 💰 $0.04 🟢 100% (1/1) / ⏱️ 76.1s / 💰 $0.09 🔴 0% (0/1) / ⏱️ 10.7s / 💰 $0.01 🔴 0% (0/1) / ⏱️ 7.6s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 66.0s 🟢 100% (1/1) / ⏱️ 42.1s / 💰 $0.26 🟢 100% (1/1) / ⏱️ 38.8s / 💰 $0.18
43_current_datetime_from_prompt 🔗 🔴 0% (0/1) / ⏱️ 52.5s / 💰 $0.00 🟢 100% (1/1) / ⏱️ 41.5s / 💰 $0.00 🔴 0% (0/1) / ⏱️ 15.2s / 💰 $0.03 🔴 0% (0/1) / ⏱️ 21.2s / 💰 $0.04 🟢 100% (1/1) / ⏱️ 33.0s / 💰 $0.03 🟢 100% (1/1) / ⏱️ 8.0s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 12.2s 🟢 100% (1/1) / ⏱️ 11.4s / 💰 $0.11 🟢 100% (1/1) / ⏱️ 7.6s / 💰 $0.07
61_exact_match_counting 🔗 🟢 100% (1/1) / ⏱️ 68.5s / 💰 $0.01 🟢 100% (1/1) / ⏱️ 97.5s / 💰 $0.01 🟢 100% (1/1) / ⏱️ 11.3s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 104.4s / 💰 $0.07 🟢 100% (1/1) / ⏱️ 74.2s / 💰 $0.09 🟢 100% (1/1) / ⏱️ 15.4s / 💰 $0.04 🟢 100% (1/1) / ⏱️ 14.2s 🟢 100% (1/1) / ⏱️ 29.3s / 💰 $0.16 🟢 100% (1/1) / ⏱️ 19.5s / 💰 $0.11
73a_time_window_anomaly 🔗 🟢 100% (1/1) / ⏱️ 104.4s / 💰 $0.01 🔴 0% (0/1) / ⏱️ 320.8s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 79.6s / 💰 $0.16 🔴 0% (0/1) / ⏱️ 55.6s 🟢 100% (1/1) / ⏱️ 309.2s / 💰 $0.45 🟢 100% (1/1) / ⏱️ 27.1s / 💰 $0.05 🔴 0% (0/1) / ⏱️ 104.4s 🟢 100% (1/1) / ⏱️ 34.1s / 💰 $0.22 🟢 100% (1/1) / ⏱️ 44.3s / 💰 $0.22
73b_time_window_anomaly 🔗 🟢 100% (1/1) / ⏱️ 181.8s / 💰 $0.02 🟢 100% (1/1) / ⏱️ 255.2s / 💰 $0.02 🔴 0% (0/1) / ⏱️ 19.0s / 💰 $0.04 🔴 0% (0/1) / ⏱️ 101.4s / 💰 $0.23 🟢 100% (1/1) / ⏱️ 251.9s / 💰 $0.23 🔴 0% (0/1) / ⏱️ 28.2s / 💰 $0.05 🟢 100% (1/1) / ⏱️ 52.1s 🟢 100% (1/1) / ⏱️ 38.8s / 💰 $0.23 🟢 100% (1/1) / ⏱️ 45.8s / 💰 $0.20
96_no_matching_runbook 🔗 🔴 0% (0/1) / ⏱️ 205.2s / 💰 $0.03 🟢 100% (1/1) / ⏱️ 422.5s / 💰 $0.03 🔴 0% (0/1) / ⏱️ 34.6s / 💰 $0.10 🟢 100% (1/1) / ⏱️ 117.1s / 💰 $0.25 🔴 0% (0/1) / ⏱️ 836.0s / 💰 $0.75 🟢 100% (1/1) / ⏱️ 56.3s / 💰 $0.10 🟢 100% (1/1) / ⏱️ 218.4s 🔴 0% (0/1) / ⏱️ 62.5s / 💰 $0.37 🟢 100% (1/1) / ⏱️ 66.8s / 💰 $0.31

Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-21471579810.