⚡ January 20, 2026¶
Generated: 2026-01-20 14:37 UTC
Total Duration: 46m 18s
Iterations: 1
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 52 | 18 | 0 | 70 | 🟡 74% (52/70) |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 62 | 8 | 0 | 70 | 🟡 89% (62/70) |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 59 | 11 | 0 | 70 | 🟡 84% (59/70) |
| gpt-5.2 | 56 | 14 | 0 | 70 | 🟡 80% (56/70) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 70 | $0.05 | $0.00 | $0.13 | $3.26 |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 70 | $0.19 | $0.01 | $0.38 | $13.15 |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 70 | $0.16 | $0.01 | $0.33 | $11.23 |
| gpt-5.2 | 70 | $0.09 | $0.01 | $0.31 | $6.24 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 31.1 | 3.9 | 79.2 | 29.6 | 63.9 |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 38.4 | 6.1 | 75.1 | 39.6 | 60.8 |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 41.7 | 5.2 | 91.5 | 44.9 | 63.5 |
| gpt-5.2 | 37.4 | 9.2 | 90.2 | 38.2 | 66.9 |
Performance by Tag¶
Success rate by test category and model:
| Tag | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gpt-5.2 | Warnings |
|---|---|---|---|---|---|
| benchmark | 🟡 60% (15/25) | 🟡 76% (19/25) | 🟡 64% (16/25) | 🟡 68% (17/25) | |
| context_window | 🟡 50% (5/10) | 🟢 100% (10/10) | 🟡 60% (6/10) | 🟡 90% (9/10) | |
| counting | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| datetime | 🟡 67% (10/15) | 🟢 100% (15/15) | 🟡 73% (11/15) | 🟡 87% (13/15) | |
| easy | 🟡 80% (28/35) | 🟡 94% (33/35) | 🟡 94% (33/35) | 🟡 83% (29/35) | |
| grafana-dashboard | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| hard | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| kubernetes | 🟡 77% (27/35) | 🟡 94% (33/35) | 🟡 94% (33/35) | 🟡 94% (33/35) | |
| logs | 🟡 40% (8/20) | 🟡 65% (13/20) | 🟡 45% (9/20) | 🟡 60% (12/20) | |
| loki | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟡 60% (⅗) | |
| medium | 🟡 63% (19/30) | 🟡 80% (24/30) | 🟡 70% (21/30) | 🟡 73% (22/30) | |
| metrics | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| network | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| one-test | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| port-forward | 🟡 80% (8/10) | 🟡 80% (8/10) | 🟡 80% (8/10) | 🟡 80% (8/10) | |
| question-answer | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| regression | 🟡 82% (37/45) | 🟡 96% (43/45) | 🟡 96% (43/45) | 🟡 87% (39/45) | |
| runbooks | 🟡 90% (9/10) | 🟡 90% (9/10) | 🟢 100% (10/10) | 🟡 80% (8/10) | |
| Overall | 🟡 74% (52/70) | 🟡 89% (62/70) | 🟡 84% (59/70) | 🟡 80% (56/70) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
| Eval ID | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gpt-5.2 |
|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 | 🟢 | 🟢 | 🟢 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 | 🟡 | 🟡 | 🟡 |
| 108_logs_nearby_lines 🔗 | 🔴 | 🔴 | 🔴 | 🔴 |
| 111_pod_names_contain_service 🔗 | 🟢 | 🟢 | 🟢 | 🟢 |
| 12_job_crashing 🔗 | 🟢 | 🟢 | 🟢 | 🟡 |
| 162_get_runbooks 🔗 | 🟡 | 🟢 | 🟢 | 🟢 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 | 🟢 | 🟢 | 🟢 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 | 🟢 | 🟢 | 🟢 |
| 24_misconfigured_pvc 🔗 | 🔴 | 🟢 | 🟢 | 🟢 |
| 43_current_datetime_from_prompt 🔗 | 🟢 | 🟢 | 🟢 | 🟡 |
| 61_exact_match_counting 🔗 | 🟢 | 🟢 | 🟢 | 🟢 |
| 73a_time_window_anomaly 🔗 | 🟡 | 🟢 | 🟡 | 🟢 |
| 73b_time_window_anomaly 🔗 | 🟡 | 🟢 | 🟡 | 🟡 |
| 96_no_matching_runbook 🔗 | 🟢 | 🟡 | 🟢 | 🟡 |
| SUMMARY | 🟡 74% (52/70) | 🟡 89% (62/70) | 🟡 84% (59/70) | 🟡 80% (56/70) |
Detailed Raw Results¶
| Eval ID | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gpt-5.2 |
|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (5/5) / ⏱️ 29.3s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 32.7s / 💰 $0.13 | 🟢 100% (5/5) / ⏱️ 34.6s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 40.2s / 💰 $0.12 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 60% (⅗) / ⏱️ 41.7s / 💰 $0.06 | 🟡 60% (⅗) / ⏱️ 59.1s / 💰 $0.27 | 🟡 60% (⅗) / ⏱️ 63.9s / 💰 $0.21 | 🟡 60% (⅗) / ⏱️ 54.3s / 💰 $0.11 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/5) / ⏱️ 58.4s / 💰 $0.10 | 🔴 0% (0/5) / ⏱️ 50.1s / 💰 $0.31 | 🔴 0% (0/5) / ⏱️ 54.7s / 💰 $0.18 | 🔴 0% (0/5) / ⏱️ 58.9s / 💰 $0.13 |
| 111_pod_names_contain_service 🔗 | 🟢 100% (5/5) / ⏱️ 24.6s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 34.0s / 💰 $0.14 | 🟢 100% (5/5) / ⏱️ 43.7s / 💰 $0.13 | 🟢 100% (5/5) / ⏱️ 46.5s / 💰 $0.11 |
| 12_job_crashing 🔗 | 🟢 100% (5/5) / ⏱️ 29.8s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 44.8s / 💰 $0.19 | 🟢 100% (5/5) / ⏱️ 52.4s / 💰 $0.18 | 🟡 40% (⅖) / ⏱️ 41.5s / 💰 $0.09 |
| 162_get_runbooks 🔗 | 🟡 80% (⅘) / ⏱️ 33.1s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 46.2s / 💰 $0.27 | 🟢 100% (5/5) / ⏱️ 52.5s / 💰 $0.21 | 🟢 100% (5/5) / ⏱️ 20.1s / 💰 $0.02 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 100% (5/5) / ⏱️ 43.9s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 43.8s / 💰 $0.25 | 🟢 100% (5/5) / ⏱️ 41.8s / 💰 $0.16 | 🟢 100% (5/5) / ⏱️ 35.9s / 💰 $0.07 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (5/5) / ⏱️ 26.2s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 31.9s / 💰 $0.20 | 🟢 100% (5/5) / ⏱️ 23.8s / 💰 $0.12 | 🟢 100% (5/5) / ⏱️ 20.3s / 💰 $0.06 |
| 24_misconfigured_pvc 🔗 | 🔴 0% (0/5) / ⏱️ 5.9s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 39.5s / 💰 $0.16 | 🟢 100% (5/5) / ⏱️ 39.7s / 💰 $0.12 | 🟢 100% (5/5) / ⏱️ 40.5s / 💰 $0.09 |
| 43_current_datetime_from_prompt 🔗 | 🟢 100% (5/5) / ⏱️ 4.3s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 6.5s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 5.5s / 💰 $0.01 | 🟡 80% (⅘) / ⏱️ 10.0s / 💰 $0.02 |
| 61_exact_match_counting 🔗 | 🟢 100% (5/5) / ⏱️ 13.7s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 17.8s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 14.9s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 16.3s / 💰 $0.04 |
| 73a_time_window_anomaly 🔗 | 🟡 20% (⅕) / ⏱️ 34.9s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 34.7s / 💰 $0.14 | 🟡 80% (⅘) / ⏱️ 49.9s / 💰 $0.25 | 🟢 100% (5/5) / ⏱️ 34.6s / 💰 $0.08 |
| 73b_time_window_anomaly 🔗 | 🟡 80% (⅘) / ⏱️ 34.6s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 40.4s / 💰 $0.15 | 🟡 40% (⅖) / ⏱️ 46.3s / 💰 $0.24 | 🟡 80% (⅘) / ⏱️ 44.7s / 💰 $0.11 |
| 96_no_matching_runbook 🔗 | 🟢 100% (5/5) / ⏱️ 54.8s / 💰 $0.10 | 🟡 80% (⅘) / ⏱️ 56.1s / 💰 $0.33 | 🟢 100% (5/5) / ⏱️ 60.8s / 💰 $0.30 | 🟡 60% (⅗) / ⏱️ 59.5s / 💰 $0.19 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-21173789819.