⚡ January 29, 2026¶
Generated: 2026-01-29 09:48 UTC
Total Duration: 56m 56s
Iterations: 1
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek-chat | 11 | 3 | 0 | 14 | 🟡 79% (11/14) |
| deepseek-reasoner | 9 | 5 | 0 | 14 | 🟡 64% (9/14) |
| gemini-3-flash-preview | 7 | 7 | 0 | 14 | 🟡 50% (7/14) |
| gemini-3-pro-preview | 8 | 6 | 0 | 14 | 🟡 57% (8/14) |
| gpt-5.2-high-reasoning | 9 | 5 | 0 | 14 | 🟡 64% (9/14) |
| haiku-4.5 | 10 | 4 | 0 | 14 | 🟡 71% (10/14) |
| kimi-2.5-openrouter | 11 | 3 | 0 | 14 | 🟡 79% (11/14) |
| opus-4.5 | 12 | 2 | 0 | 14 | 🟡 86% (12/14) |
| sonnet-4.5 | 13 | 1 | 0 | 14 | 🟡 93% (13/14) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| deepseek-chat | 14 | $0.01 | $0.00 | $0.03 | $0.20 |
| deepseek-reasoner | 14 | $0.02 | $0.00 | $0.03 | $0.23 |
| gemini-3-flash-preview | 13 | $0.06 | $0.02 | $0.16 | $0.80 |
| gemini-3-pro-preview | 11 | $0.14 | $0.04 | $0.25 | $1.59 |
| gpt-5.2-high-reasoning | 14 | $0.25 | $0.01 | $0.75 | $3.56 |
| haiku-4.5 | 14 | $0.05 | $0.02 | $0.10 | $0.76 |
| opus-4.5 | 14 | $0.25 | $0.11 | $0.38 | $3.45 |
| sonnet-4.5 | 14 | $0.19 | $0.07 | $0.31 | $2.64 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek-chat | 131.8 | 52.5 | 205.2 | 143.4 | 205.2 |
| deepseek-reasoner | 222.4 | 30.3 | 430.3 | 255.2 | 430.3 |
| gemini-3-flash-preview | 38.0 | 11.3 | 104.3 | 34.6 | 104.3 |
| gemini-3-pro-preview | 69.8 | 7.0 | 128.0 | 76.1 | 128.0 |
| gpt-5.2-high-reasoning | 256.1 | 10.7 | 836.0 | 254.6 | 836.0 |
| haiku-4.5 | 27.4 | 7.6 | 56.3 | 27.8 | 56.3 |
| kimi-2.5-openrouter | 70.7 | 12.2 | 218.4 | 66.0 | 218.4 |
| opus-4.5 | 38.3 | 11.4 | 62.5 | 38.1 | 62.5 |
| sonnet-4.5 | 39.7 | 7.6 | 66.8 | 42.5 | 66.8 |
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek-chat | deepseek-reasoner | gemini-3-flash-preview | gemini-3-pro-preview | gpt-5.2-high-reasoning | haiku-4.5 | kimi-2.5-openrouter | opus-4.5 | sonnet-4.5 | Warnings |
|---|---|---|---|---|---|---|---|---|---|---|
| benchmark | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 80% (⅘) | |
| context_window | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | 🟢 100% (2/2) | |
| counting | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| datetime | 🟡 67% (⅔) | 🟡 67% (⅔) | 🟡 33% (⅓) | 🔴 0% (0/3) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟢 100% (3/3) | |
| easy | 🟡 86% (6/7) | 🟡 57% (4/7) | 🟡 43% (3/7) | 🟡 57% (4/7) | 🟡 71% (5/7) | 🟡 86% (6/7) | 🟢 100% (7/7) | 🟢 100% (7/7) | 🟢 100% (7/7) | |
| grafana-dashboard | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| hard | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| kubernetes | 🟢 100% (7/7) | 🟡 57% (4/7) | 🟡 57% (4/7) | 🟡 57% (4/7) | 🟡 57% (4/7) | 🟡 71% (5/7) | 🟡 86% (6/7) | 🟢 100% (7/7) | 🟢 100% (7/7) | |
| logs | 🟡 75% (¾) | 🟡 25% (¼) | 🟡 50% (2/4) | 🟡 25% (¼) | 🟡 75% (¾) | 🟡 50% (2/4) | 🟡 50% (2/4) | 🟡 75% (¾) | 🟡 75% (¾) | |
| loki | 🟢 100% (1/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| medium | 🟡 67% (4/6) | 🟡 67% (4/6) | 🟡 50% (3/6) | 🟡 50% (3/6) | 🟡 50% (3/6) | 🟡 50% (3/6) | 🟡 50% (3/6) | 🟡 67% (4/6) | 🟡 83% (⅚) | |
| metrics | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| network | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| one-test | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| port-forward | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | |
| question-answer | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| regression | 🟡 89% (8/9) | 🟡 67% (6/9) | 🟡 44% (4/9) | 🟡 56% (5/9) | 🟡 67% (6/9) | 🟡 78% (7/9) | 🟡 89% (8/9) | 🟢 100% (9/9) | 🟢 100% (9/9) | |
| runbooks | 🟡 50% (½) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | |
| Overall | 🟡 79% (11/14) | 🟡 64% (9/14) | 🟡 50% (7/14) | 🟡 57% (8/14) | 🟡 64% (9/14) | 🟡 71% (10/14) | 🟡 79% (11/14) | 🟡 86% (12/14) | 🟡 93% (13/14) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
| Eval ID | deepseek-chat | deepseek-reasoner | gemini-3-flash-preview | gemini-3-pro-preview | gpt-5.2-high-reasoning | haiku-4.5 | kimi-2.5-openrouter | opus-4.5 | sonnet-4.5 |
|---|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 | 🔴 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 108_logs_nearby_lines 🔗 | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 |
| 111_pod_names_contain_service 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 |
| 12_job_crashing 🔗 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 162_get_runbooks 🔗 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 24_misconfigured_pvc 🔗 | 🟢 | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 🟢 |
| 43_current_datetime_from_prompt 🔗 | 🔴 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 61_exact_match_counting 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 73a_time_window_anomaly 🔗 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 |
| 73b_time_window_anomaly 🔗 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 |
| 96_no_matching_runbook 🔗 | 🔴 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 🟢 | 🔴 | 🟢 |
| SUMMARY | 🟡 79% (11/14) | 🟡 64% (9/14) | 🟡 50% (7/14) | 🟡 57% (8/14) | 🟡 64% (9/14) | 🟡 71% (10/14) | 🟡 79% (11/14) | 🟡 86% (12/14) | 🟡 93% (13/14) |
Detailed Raw Results¶
| Eval ID | deepseek-chat | deepseek-reasoner | gemini-3-flash-preview | gemini-3-pro-preview | gpt-5.2-high-reasoning | haiku-4.5 | kimi-2.5-openrouter | opus-4.5 | sonnet-4.5 |
|---|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (1/1) / ⏱️ 152.2s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 44.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 25.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 47.2s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 46.9s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 27.8s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 62.3s | 🟢 100% (1/1) / ⏱️ 36.5s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 37.2s / 💰 $0.18 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟢 100% (1/1) / ⏱️ 157.2s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 430.3s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 104.3s | 🔴 0% (0/1) / ⏱️ 79.4s | 🟢 100% (1/1) / ⏱️ 590.4s / 💰 $0.62 | 🟢 100% (1/1) / ⏱️ 32.0s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 97.7s | 🟢 100% (1/1) / ⏱️ 36.4s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 42.5s / 💰 $0.18 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/1) / ⏱️ 143.4s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 323.1s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 52.2s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 97.4s / 💰 $0.22 | 🔴 0% (0/1) / ⏱️ 493.6s / 💰 $0.44 | 🔴 0% (0/1) / ⏱️ 31.3s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 34.6s | 🔴 0% (0/1) / ⏱️ 49.2s / 💰 $0.38 | 🔴 0% (0/1) / ⏱️ 58.4s / 💰 $0.24 |
| 111_pod_names_contain_service 🔗 | 🟢 100% (1/1) / ⏱️ 130.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 193.3s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 22.8s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 39.4s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 29.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 25.9s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 73.3s | 🟢 100% (1/1) / ⏱️ 37.6s / 💰 $0.23 | 🟢 100% (1/1) / ⏱️ 35.8s / 💰 $0.15 |
| 12_job_crashing 🔗 | 🟢 100% (1/1) / ⏱️ 192.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 226.7s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 51.3s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 128.0s / 💰 $0.24 | 🟢 100% (1/1) / ⏱️ 348.2s / 💰 $0.32 | 🟢 100% (1/1) / ⏱️ 35.9s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 81.4s | 🟢 100% (1/1) / ⏱️ 38.1s / 💰 $0.24 | 🟢 100% (1/1) / ⏱️ 40.4s / 💰 $0.20 |
| 162_get_runbooks 🔗 | 🟢 100% (1/1) / ⏱️ 129.1s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 265.0s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 39.2s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 62.4s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 261.5s / 💰 $0.30 | 🔴 0% (0/1) / ⏱️ 27.7s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 41.2s | 🟢 100% (1/1) / ⏱️ 44.0s / 💰 $0.32 | 🟢 100% (1/1) / ⏱️ 53.0s / 💰 $0.25 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 100% (1/1) / ⏱️ 167.9s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 307.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 41.8s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 7.0s | 🟢 100% (1/1) / ⏱️ 254.6s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 37.0s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 85.7s | 🟢 100% (1/1) / ⏱️ 52.7s / 💰 $0.32 | 🟢 100% (1/1) / ⏱️ 44.7s / 💰 $0.21 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (1/1) / ⏱️ 62.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 154.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 16.5s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 40.4s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 46.5s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 23.8s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 46.0s | 🟢 100% (1/1) / ⏱️ 24.1s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 21.5s / 💰 $0.12 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (1/1) / ⏱️ 97.1s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 30.3s / 💰 $0.00 | 🔴 0% (0/1) / ⏱️ 17.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 76.1s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 10.7s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 7.6s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 66.0s | 🟢 100% (1/1) / ⏱️ 42.1s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 38.8s / 💰 $0.18 |
| 43_current_datetime_from_prompt 🔗 | 🔴 0% (0/1) / ⏱️ 52.5s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 41.5s / 💰 $0.00 | 🔴 0% (0/1) / ⏱️ 15.2s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 21.2s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 33.0s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 8.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 12.2s | 🟢 100% (1/1) / ⏱️ 11.4s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 7.6s / 💰 $0.07 |
| 61_exact_match_counting 🔗 | 🟢 100% (1/1) / ⏱️ 68.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 97.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 11.3s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 104.4s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 74.2s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 15.4s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 14.2s | 🟢 100% (1/1) / ⏱️ 29.3s / 💰 $0.16 | 🟢 100% (1/1) / ⏱️ 19.5s / 💰 $0.11 |
| 73a_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 104.4s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 320.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 79.6s / 💰 $0.16 | 🔴 0% (0/1) / ⏱️ 55.6s | 🟢 100% (1/1) / ⏱️ 309.2s / 💰 $0.45 | 🟢 100% (1/1) / ⏱️ 27.1s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 104.4s | 🟢 100% (1/1) / ⏱️ 34.1s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 44.3s / 💰 $0.22 |
| 73b_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 181.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 255.2s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 19.0s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 101.4s / 💰 $0.23 | 🟢 100% (1/1) / ⏱️ 251.9s / 💰 $0.23 | 🔴 0% (0/1) / ⏱️ 28.2s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 52.1s | 🟢 100% (1/1) / ⏱️ 38.8s / 💰 $0.23 | 🟢 100% (1/1) / ⏱️ 45.8s / 💰 $0.20 |
| 96_no_matching_runbook 🔗 | 🔴 0% (0/1) / ⏱️ 205.2s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 422.5s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 34.6s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 117.1s / 💰 $0.25 | 🔴 0% (0/1) / ⏱️ 836.0s / 💰 $0.75 | 🟢 100% (1/1) / ⏱️ 56.3s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 218.4s | 🔴 0% (0/1) / ⏱️ 62.5s / 💰 $0.37 | 🟢 100% (1/1) / ⏱️ 66.8s / 💰 $0.31 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-21471579810.