⚡ January 04, 2026¶
Generated: 2026-01-04 17:43 UTC
Total Duration: 43m 0s
Iterations: 5
Judge (classifier) model: gpt-4.1
About this Benchmark¶
Fast Benchmark: Quick regression tests using markers regression or benchmark - designed to run frequently and catch regressions.
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek-3.1 | 49 | 21 | 0 | 70 | 🟡 70% (49/70) |
| gpt-5 | 37 | 33 | 0 | 70 | 🟡 53% (37/70) |
| gpt-5.1 | 36 | 34 | 0 | 70 | 🟡 51% (36/70) |
| haiku-4.5 | 44 | 26 | 0 | 70 | 🟡 63% (44/70) |
| sonnet-4.5 | 59 | 11 | 0 | 70 | 🟡 84% (59/70) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| gpt-5 | 70 | $0.06 | $0.00 | $0.28 | $4.38 |
| gpt-5.1 | 65 | $0.11 | $0.00 | $0.39 | $7.46 |
| haiku-4.5 | 70 | $0.04 | $0.00 | $0.11 | $2.84 |
| sonnet-4.5 | 70 | $0.17 | $0.01 | $0.34 | $11.66 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek-3.1 | 87.8 | 6.6 | 167.8 | 93.2 | 143.1 |
| gpt-5 | 33.1 | 3.7 | 639.3 | 25.7 | 46.6 |
| gpt-5.1 | 89.1 | 6.5 | 338.4 | 74.4 | 202.3 |
| haiku-4.5 | 27.5 | 3.1 | 59.8 | 30.4 | 42.2 |
| sonnet-4.5 | 43.6 | 7.6 | 94.5 | 46.6 | 66.2 |
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek-3.1 | gpt-5 | gpt-5.1 | haiku-4.5 | sonnet-4.5 | Warnings |
|---|---|---|---|---|---|---|
| benchmark | 🟡 44% (11/25) | 🟡 48% (12/25) | 🟡 36% (9/25) | 🟡 28% (7/25) | 🟡 56% (14/25) | |
| context_window | 🟡 90% (9/10) | 🟡 90% (9/10) | 🟡 50% (5/10) | 🟡 20% (2/10) | 🟡 90% (9/10) | |
| counting | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| datetime | 🟡 87% (13/15) | 🟡 87% (13/15) | 🟡 67% (10/15) | 🟡 47% (7/15) | 🟡 93% (14/15) | |
| easy | 🟡 89% (31/35) | 🟡 63% (22/35) | 🟡 66% (23/35) | 🟡 86% (30/35) | 🟢 100% (35/35) | |
| grafana-dashboard | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | |
| hard | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | |
| kubernetes | 🟡 71% (25/35) | 🟡 43% (15/35) | 🟡 46% (16/35) | 🟡 63% (22/35) | 🟡 86% (30/35) | |
| logs | 🟡 60% (12/20) | 🟡 70% (14/20) | 🟡 50% (10/20) | 🟡 35% (7/20) | 🟡 70% (14/20) | |
| loki | 🟡 60% (⅗) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 80% (⅘) | 🟢 100% (5/5) | |
| medium | 🟡 60% (18/30) | 🟡 50% (15/30) | 🟡 43% (13/30) | 🟡 47% (14/30) | 🟡 80% (24/30) | |
| metrics | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | |
| network | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 80% (⅘) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| one-test | 🟢 100% (5/5) | 🟡 40% (⅖) | 🟡 60% (⅗) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| port-forward | 🟡 30% (3/10) | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 40% (4/10) | 🟡 50% (5/10) | |
| question-answer | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | 🔴 0% (0/5) | |
| regression | 🟡 84% (38/45) | 🟡 56% (25/45) | 🟡 60% (27/45) | 🟡 82% (37/45) | 🟢 100% (45/45) | |
| runbooks | 🟡 40% (4/10) | 🟡 60% (6/10) | 🟡 60% (6/10) | 🟡 70% (7/10) | 🟢 100% (10/10) | |
| Overall | 🟡 70% (49/70) | 🟡 53% (37/70) | 🟡 51% (36/70) | 🟡 63% (44/70) | 🟡 84% (59/70) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
| Eval ID | deepseek-3.1 | gpt-5 | gpt-5.1 | haiku-4.5 | sonnet-4.5 |
|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 | 🟡 | 🟡 | 🟢 | 🟢 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 | 🟢 | 🟢 | 🟡 | 🟢 |
| 108_logs_nearby_lines 🔗 | 🔴 | 🔴 | 🔴 | 🟡 | 🔴 |
| 111_pod_names_contain_service 🔗 | 🟢 | 🔴 | 🟡 | 🟡 | 🟢 |
| 12_job_crashing 🔗 | 🟡 | 🟡 | 🟡 | 🟢 | 🟢 |
| 162_get_runbooks 🔗 | 🟡 | 🟡 | 🟡 | 🟡 | 🟢 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 | 🟢 | 🟡 | 🟢 | 🟢 |
| 179_grafana_big_dashboard_query 🔗 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 |
| 24_misconfigured_pvc 🔗 | 🟢 | 🔴 | 🔴 | 🟡 | 🟢 |
| 43_current_datetime_from_prompt 🔗 | 🟡 | 🟡 | 🟢 | 🟢 | 🟢 |
| 61_exact_match_counting 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 73a_time_window_anomaly 🔗 | 🟡 | 🟢 | 🟡 | 🟡 | 🟡 |
| 73b_time_window_anomaly 🔗 | 🟢 | 🟡 | 🟡 | 🟡 | 🟢 |
| 96_no_matching_runbook 🔗 | 🟡 | 🟡 | 🟡 | 🟡 | 🟢 |
| SUMMARY | 🟡 70% (49/70) | 🟡 53% (37/70) | 🟡 51% (36/70) | 🟡 63% (44/70) | 🟡 84% (59/70) |
Detailed Raw Results¶
| Eval ID | deepseek-3.1 | gpt-5 | gpt-5.1 | haiku-4.5 | sonnet-4.5 |
|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (5/5) / ⏱️ 83.5s | 🟡 40% (⅖) / ⏱️ 15.1s / 💰 $0.04 | 🟡 60% (⅗) / ⏱️ 49.8s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 26.4s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 38.1s / 💰 $0.13 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 60% (⅗) / ⏱️ 110.9s | 🟢 100% (5/5) / ⏱️ 29.9s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 196.8s / 💰 $0.24 | 🟡 80% (⅘) / ⏱️ 36.0s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 45.9s / 💰 $0.17 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/5) / ⏱️ 134.4s | 🔴 0% (0/5) / ⏱️ 33.2s / 💰 $0.09 | 🔴 0% (0/5) / ⏱️ 131.9s / 💰 $0.18 | 🟡 20% (⅕) / ⏱️ 36.5s / 💰 $0.06 | 🔴 0% (0/5) / ⏱️ 63.4s / 💰 $0.21 |
| 111_pod_names_contain_service 🔗 | 🟢 100% (5/5) / ⏱️ 95.7s | 🔴 0% (0/5) / ⏱️ 12.4s / 💰 $0.02 | 🟡 40% (⅖) / ⏱️ 67.5s / 💰 $0.07 | 🟡 80% (⅘) / ⏱️ 24.0s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 53.5s / 💰 $0.23 |
| 12_job_crashing 🔗 | 🟡 80% (⅘) / ⏱️ 95.0s | 🟡 20% (⅕) / ⏱️ 31.4s / 💰 $0.07 | 🟡 20% (⅕) / ⏱️ 85.4s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 31.0s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 48.8s / 💰 $0.16 |
| 162_get_runbooks 🔗 | 🟡 40% (⅖) / ⏱️ 82.2s | 🟡 60% (⅗) / ⏱️ 159.8s / 💰 $0.13 | 🟡 40% (⅖) / ⏱️ 96.1s / 💰 $0.12 | 🟡 60% (⅗) / ⏱️ 36.0s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 50.5s / 💰 $0.21 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 100% (5/5) / ⏱️ 93.0s | 🟢 100% (5/5) / ⏱️ 35.3s / 💰 $0.09 | 🟡 80% (⅘) / ⏱️ 91.9s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 37.8s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 49.4s / 💰 $0.21 |
| 179_grafana_big_dashboard_query 🔗 | 🔴 0% (0/5) / ⏱️ 94.4s | 🔴 0% (0/5) / ⏱️ 21.2s / 💰 $0.04 | 🔴 0% (0/5) / ⏱️ 175.0s / 💰 $0.20 | 🔴 0% (0/5) / ⏱️ 29.5s / 💰 $0.04 | 🔴 0% (0/5) / ⏱️ 34.4s / 💰 $0.13 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (5/5) / ⏱️ 103.6s | 🔴 0% (0/5) / ⏱️ 4.3s / 💰 $0.00 | 🔴 0% (0/5) / ⏱️ 16.9s / 💰 $0.01 | 🟡 20% (⅕) / ⏱️ 9.9s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 42.5s / 💰 $0.14 |
| 43_current_datetime_from_prompt 🔗 | 🟡 80% (⅘) / ⏱️ 12.9s | 🟡 80% (⅘) / ⏱️ 7.4s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 24.0s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 6.3s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 9.8s / 💰 $0.01 |
| 61_exact_match_counting 🔗 | 🟢 100% (5/5) / ⏱️ 35.6s | 🟢 100% (5/5) / ⏱️ 16.4s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 49.8s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 14.0s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 19.1s / 💰 $0.06 |
| 73a_time_window_anomaly 🔗 | 🟡 80% (⅘) / ⏱️ 95.6s | 🟢 100% (5/5) / ⏱️ 27.5s / 💰 $0.07 | 🟡 80% (⅘) / ⏱️ 93.0s / 💰 $0.13 | 🟡 20% (⅕) / ⏱️ 31.8s / 💰 $0.05 | 🟡 80% (⅘) / ⏱️ 48.6s / 💰 $0.20 |
| 73b_time_window_anomaly 🔗 | 🟢 100% (5/5) / ⏱️ 93.2s | 🟡 80% (⅘) / ⏱️ 28.2s / 💰 $0.07 | 🟡 20% (⅕) / ⏱️ 37.3s / 💰 $0.03 | 🟡 20% (⅕) / ⏱️ 32.9s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 46.3s / 💰 $0.18 |
| 96_no_matching_runbook 🔗 | 🟡 40% (⅖) / ⏱️ 99.9s | 🟡 60% (⅗) / ⏱️ 41.0s / 💰 $0.13 | 🟡 80% (⅘) / ⏱️ 132.4s / 💰 $0.17 | 🟡 80% (⅘) / ⏱️ 33.5s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 60.6s / 💰 $0.30 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: local-benchmark-20260101-140005.