⚡ June 09, 2026¶
Generated: 2026-06-09 11:37 UTC
Total Duration: 1h 5m 13s
Iterations: 5
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| gpt-5.4 | 70 | 15 | 0 | 85 | 🟡 82% (70/85) |
| gpt-5.5 | 75 | 10 | 0 | 85 | 🟡 88% (75/85) |
| opus-4.6 | 76 | 9 | 0 | 85 | 🟡 89% (76/85) |
| opus-4.7 | 74 | 11 | 0 | 85 | 🟡 87% (74/85) |
| opus-4.8 | 75 | 10 | 0 | 85 | 🟡 88% (75/85) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| gpt-5.4 | 85 | $0.05 | $0.00 | $0.11 | $3.87 |
| gpt-5.5 | 85 | $0.16 | $0.01 | $0.48 | $13.83 |
| opus-4.6 | 85 | $0.25 | $0.10 | $2.98 | $21.13 |
| opus-4.7 | 85 | $0.18 | $0.02 | $0.94 | $14.99 |
| opus-4.8 | 85 | $0.22 | $0.01 | $1.26 | $18.57 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| gpt-5.4 | 23.6 | 3.4 | 49.6 | 24.5 | 43.2 |
| gpt-5.5 | 44.9 | 4.7 | 152.0 | 41.5 | 92.0 |
| opus-4.6 | 41.6 | 5.9 | 558.2 | 33.6 | 77.5 |
| opus-4.7 | 27.4 | 4.1 | 147.3 | 23.9 | 56.2 |
| opus-4.8 | 38.2 | 4.2 | 275.9 | 25.0 | 101.9 |
Performance by Tag¶
Success rate by test category and model:
| Tag | gpt-5.4 | gpt-5.5 | opus-4.6 | opus-4.7 | opus-4.8 | Warnings |
|---|---|---|---|---|---|---|
| benchmark | 🟡 67% (20/30) | 🟡 73% (22/30) | 🟡 73% (22/30) | 🟡 77% (23/30) | 🟡 73% (22/30) | |
| context_window | 🟡 80% (8/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 90% (9/10) | |
| counting | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 90% (9/10) | 🟢 100% (10/10) | |
| datetime | 🟡 87% (13/15) | 🟢 100% (15/15) | 🟢 100% (15/15) | 🟢 100% (15/15) | 🟡 93% (14/15) | |
| easy | 🟡 88% (35/40) | 🟡 95% (38/40) | 🟡 98% (39/40) | 🟡 92% (37/40) | 🟡 95% (38/40) | |
| grafana | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| hard | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 50% (5/10) | |
| kubernetes | 🟡 89% (40/45) | 🟡 89% (40/45) | 🟡 93% (42/45) | 🟡 87% (39/45) | 🟡 91% (41/45) | |
| logs | 🟡 60% (18/30) | 🟡 67% (20/30) | 🟡 73% (22/30) | 🟡 70% (21/30) | 🟡 67% (20/30) | |
| loki | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 70% (7/10) | 🟡 60% (6/10) | 🟡 60% (6/10) | |
| medium | 🟡 83% (25/30) | 🟡 90% (27/30) | 🟡 90% (27/30) | 🟡 93% (28/30) | 🟡 90% (27/30) | |
| metrics | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| network | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 80% (⅘) | 🟢 100% (5/5) | |
| one-test | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| port-forward | 🟡 67% (10/15) | 🟡 67% (10/15) | 🟡 80% (12/15) | 🟡 73% (11/15) | 🟡 73% (11/15) | |
| question-answer | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| regression | 🟡 91% (50/55) | 🟡 96% (53/55) | 🟡 98% (54/55) | 🟡 93% (51/55) | 🟡 96% (53/55) | |
| skills | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 80% (⅘) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| Overall | 🟡 82% (70/85) | 🟡 88% (75/85) | 🟡 89% (76/85) | 🟡 87% (74/85) | 🟡 88% (75/85) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
| Eval ID | gpt-5.4 | gpt-5.5 | opus-4.6 | opus-4.7 | opus-4.8 |
|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 100a_loki_historical_logs 🔗 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| 108_logs_nearby_lines 🔗 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 12_job_crashing 🔗 | 🟡 | 🟢 | 🟢 | 🟢 | 🟢 |
| 176_network_policy_blocking_traffic_no_skills 🔗 | 🟢 | 🟢 | 🟢 | 🟡 | 🟢 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 | 🟢 | 🟢 | 🟡 | 🟢 |
| 243_pod_names_contain_service 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 24_misconfigured_pvc 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 43_current_datetime_from_prompt 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 51_logs_summarize_errors 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 61_exact_match_counting 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 73a_time_window_anomaly 🔗 | 🟡 | 🟢 | 🟢 | 🟢 | 🟡 |
| 73b_time_window_anomaly 🔗 | 🟡 | 🟢 | 🟢 | 🟢 | 🟢 |
| 96_no_matching_skill 🔗 | 🟢 | 🟢 | 🟡 | 🟢 | 🟢 |
| SUMMARY | 🟡 82% (70/85) | 🟡 88% (75/85) | 🟡 89% (76/85) | 🟡 87% (74/85) | 🟡 88% (75/85) |
Detailed Raw Results¶
| Eval ID | gpt-5.4 | gpt-5.5 | opus-4.6 | opus-4.7 | opus-4.8 |
|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (5/5) / ⏱️ 21.4s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 38.2s / 💰 $0.12 | 🟢 100% (5/5) / ⏱️ 30.3s / 💰 $0.21 | 🟢 100% (5/5) / ⏱️ 18.2s / 💰 $0.09 | 🟢 100% (5/5) / ⏱️ 19.8s / 💰 $0.10 |
| 100a_loki_historical_logs 🔗 | 🟡 40% (⅖) / ⏱️ 29.8s / 💰 $0.05 | 🟡 40% (⅖) / ⏱️ 85.4s / 💰 $0.30 | 🟡 60% (⅗) / ⏱️ 104.9s / 💰 $0.44 | 🟡 60% (⅗) / ⏱️ 54.7s / 💰 $0.29 | 🟡 60% (⅗) / ⏱️ 95.6s / 💰 $0.42 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 60% (⅗) / ⏱️ 28.7s / 💰 $0.05 | 🟡 60% (⅗) / ⏱️ 94.1s / 💰 $0.28 | 🟡 80% (⅘) / ⏱️ 150.2s / 💰 $0.79 | 🟡 60% (⅗) / ⏱️ 42.9s / 💰 $0.28 | 🟡 60% (⅗) / ⏱️ 60.4s / 💰 $0.33 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/5) / ⏱️ 34.6s / 💰 $0.07 | 🔴 0% (0/5) / ⏱️ 62.1s / 💰 $0.21 | 🔴 0% (0/5) / ⏱️ 34.7s / 💰 $0.22 | 🔴 0% (0/5) / ⏱️ 32.1s / 💰 $0.17 | 🔴 0% (0/5) / ⏱️ 42.5s / 💰 $0.23 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 100% (5/5) / ⏱️ 17.1s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 20.2s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 16.9s / 💰 $0.15 | 🟢 100% (5/5) / ⏱️ 13.9s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 16.3s / 💰 $0.07 |
| 12_job_crashing 🔗 | 🟡 40% (⅖) / ⏱️ 28.8s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 48.9s / 💰 $0.17 | 🟢 100% (5/5) / ⏱️ 37.4s / 💰 $0.24 | 🟢 100% (5/5) / ⏱️ 32.9s / 💰 $0.15 | 🟢 100% (5/5) / ⏱️ 34.7s / 💰 $0.17 |
| 176_network_policy_blocking_traffic_no_skills 🔗 | 🟢 100% (5/5) / ⏱️ 32.2s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 54.8s / 💰 $0.21 | 🟢 100% (5/5) / ⏱️ 40.3s / 💰 $0.26 | 🟡 80% (⅘) / ⏱️ 54.5s / 💰 $0.42 | 🟢 100% (5/5) / ⏱️ 137.1s / 💰 $0.71 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (5/5) / ⏱️ 11.0s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 12.6s / 💰 $0.09 | 🟢 100% (5/5) / ⏱️ 16.8s / 💰 $0.17 | 🟢 100% (5/5) / ⏱️ 14.2s / 💰 $0.20 | 🟢 100% (5/5) / ⏱️ 14.2s / 💰 $0.31 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 100% (5/5) / ⏱️ 13.2s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 20.3s / 💰 $0.16 | 🟢 100% (5/5) / ⏱️ 16.9s / 💰 $0.15 | 🟡 80% (⅘) / ⏱️ 18.7s / 💰 $0.13 | 🟢 100% (5/5) / ⏱️ 17.6s / 💰 $0.13 |
| 243_pod_names_contain_service 🔗 | 🟢 100% (5/5) / ⏱️ 21.2s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 44.3s / 💰 $0.12 | 🟢 100% (5/5) / ⏱️ 33.4s / 💰 $0.19 | 🟢 100% (5/5) / ⏱️ 20.1s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 18.7s / 💰 $0.08 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (5/5) / ⏱️ 36.1s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 43.1s / 💰 $0.15 | 🟢 100% (5/5) / ⏱️ 33.7s / 💰 $0.22 | 🟢 100% (5/5) / ⏱️ 21.8s / 💰 $0.10 | 🟢 100% (5/5) / ⏱️ 26.9s / 💰 $0.12 |
| 43_current_datetime_from_prompt 🔗 | 🟢 100% (5/5) / ⏱️ 3.8s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 5.5s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 6.3s / 💰 $0.10 | 🟢 100% (5/5) / ⏱️ 4.7s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 5.7s / 💰 $0.03 |
| 51_logs_summarize_errors 🔗 | 🟢 100% (5/5) / ⏱️ 17.2s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 19.7s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 23.1s / 💰 $0.15 | 🟢 100% (5/5) / ⏱️ 20.5s / 💰 $0.12 | 🟢 100% (5/5) / ⏱️ 20.3s / 💰 $0.11 |
| 61_exact_match_counting 🔗 | 🟢 100% (5/5) / ⏱️ 5.5s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 6.0s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 9.1s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 7.5s / 💰 $0.17 | 🟢 100% (5/5) / ⏱️ 7.7s / 💰 $0.11 |
| 73a_time_window_anomaly 🔗 | 🟡 80% (⅘) / ⏱️ 28.5s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 58.2s / 💰 $0.20 | 🟢 100% (5/5) / ⏱️ 54.1s / 💰 $0.27 | 🟢 100% (5/5) / ⏱️ 31.1s / 💰 $0.14 | 🟡 80% (⅘) / ⏱️ 26.7s / 💰 $0.15 |
| 73b_time_window_anomaly 🔗 | 🟡 80% (⅘) / ⏱️ 29.3s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 65.1s / 💰 $0.24 | 🟢 100% (5/5) / ⏱️ 49.7s / 💰 $0.25 | 🟢 100% (5/5) / ⏱️ 31.3s / 💰 $0.15 | 🟢 100% (5/5) / ⏱️ 29.9s / 💰 $0.15 |
| 96_no_matching_skill 🔗 | 🟢 100% (5/5) / ⏱️ 42.0s / 💰 $0.09 | 🟢 100% (5/5) / ⏱️ 84.7s / 💰 $0.36 | 🟡 80% (⅘) / ⏱️ 49.3s / 💰 $0.30 | 🟢 100% (5/5) / ⏱️ 47.5s / 💰 $0.38 | 🟢 100% (5/5) / ⏱️ 74.7s / 💰 $0.49 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-27200013134.