⚡ January 04, 2026¶

Generated: 2026-01-04 17:43 UTC
Total Duration: 43m 0s
Iterations: 5
Judge (classifier) model: gpt-4.1

About this Benchmark¶

Fast Benchmark: Quick regression tests using markers regression or benchmark - designed to run frequently and catch regressions.

HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.

If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.

Model Accuracy Comparison¶

Model	Pass	Fail	Total	Success Rate
deepseek-3.1	49	21	70	🟡 70% (49/70)
gpt-5	37	33	70	🟡 53% (37/70)
gpt-5.1	36	34	70	🟡 51% (36/70)
haiku-4.5	44	26	70	🟡 63% (44/70)
sonnet-4.5	59	11	70	🟡 84% (59/70)

Model Cost Comparison¶

Model	Tests	Avg Cost	Min Cost	Max Cost	Total Cost
gpt-5	70	$0.06	$0.00	$0.28	$4.38
gpt-5.1	65	$0.11	$0.00	$0.39	$7.46
haiku-4.5	70	$0.04	$0.00	$0.11	$2.84
sonnet-4.5	70	$0.17	$0.01	$0.34	$11.66

Model Latency Comparison¶

Model	Avg (s)	Min (s)	Max (s)	P50 (s)	P95 (s)
deepseek-3.1	87.8	6.6	167.8	93.2	143.1
gpt-5	33.1	3.7	639.3	25.7	46.6
gpt-5.1	89.1	6.5	338.4	74.4	202.3
haiku-4.5	27.5	3.1	59.8	30.4	42.2
sonnet-4.5	43.6	7.6	94.5	46.6	66.2

Performance by Tag¶

Success rate by test category and model:

Tag	deepseek-3.1	gpt-5	gpt-5.1	haiku-4.5	sonnet-4.5
benchmark	🟡 44% (11/25)	🟡 48% (12/25)	🟡 36% (9/25)	🟡 28% (7/25)	🟡 56% (14/25)
context_window	🟡 90% (9/10)	🟡 90% (9/10)	🟡 50% (5/10)	🟡 20% (2/10)	🟡 90% (9/10)
counting	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)
datetime	🟡 87% (13/15)	🟡 87% (13/15)	🟡 67% (10/15)	🟡 47% (7/15)	🟡 93% (14/15)
easy	🟡 89% (31/35)	🟡 63% (22/35)	🟡 66% (23/35)	🟡 86% (30/35)	🟢 100% (35/35)
grafana-dashboard	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)
hard	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)
kubernetes	🟡 71% (25/35)	🟡 43% (15/35)	🟡 46% (16/35)	🟡 63% (22/35)	🟡 86% (30/35)
logs	🟡 60% (12/20)	🟡 70% (14/20)	🟡 50% (10/20)	🟡 35% (7/20)	🟡 70% (14/20)
loki	🟡 60% (⅗)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 80% (⅘)	🟢 100% (5/5)
medium	🟡 60% (18/30)	🟡 50% (15/30)	🟡 43% (13/30)	🟡 47% (14/30)	🟡 80% (24/30)
metrics	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)
network	🟢 100% (5/5)	🟢 100% (5/5)	🟡 80% (⅘)	🟢 100% (5/5)	🟢 100% (5/5)
one-test	🟢 100% (5/5)	🟡 40% (⅖)	🟡 60% (⅗)	🟢 100% (5/5)	🟢 100% (5/5)
port-forward	🟡 30% (3/10)	🟡 50% (5/10)	🟡 50% (5/10)	🟡 40% (4/10)	🟡 50% (5/10)
question-answer	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)	🔴 0% (0/5)
regression	🟡 84% (38/45)	🟡 56% (25/45)	🟡 60% (27/45)	🟡 82% (37/45)	🟢 100% (45/45)
runbooks	🟡 40% (4/10)	🟡 60% (6/10)	🟡 60% (6/10)	🟡 70% (7/10)	🟢 100% (10/10)
Overall	🟡 70% (49/70)	🟡 53% (37/70)	🟡 51% (36/70)	🟡 63% (44/70)	🟡 84% (59/70)

Raw Results¶

Status of all evaluations across models. Color coding:

🟢 Passing 100% (stable)
🟡 Passing 1-99%
🔴 Passing 0% (failing)
🔧 Mock data failure (missing or invalid test data)
⚠️ Setup failure (environment/infrastructure issue)
⏱️ Timeout or rate limit error
⏭️ Test skipped (e.g., known issue or precondition not met)

Eval ID	deepseek-3.1	gpt-5	gpt-5.1	haiku-4.5	sonnet-4.5
09_crashpod 🔗	🟢	🟡	🟡	🟢	🟢
101_loki_historical_logs_pod_deleted 🔗	🟡	🟢	🟢	🟡	🟢
108_logs_nearby_lines 🔗	🔴	🔴	🔴	🟡	🔴
111_pod_names_contain_service 🔗	🟢	🔴	🟡	🟡	🟢
12_job_crashing 🔗	🟡	🟡	🟡	🟢	🟢
162_get_runbooks 🔗	🟡	🟡	🟡	🟡	🟢
176_network_policy_blocking_traffic_no_runbooks 🔗	🟢	🟢	🟡	🟢	🟢
179_grafana_big_dashboard_query 🔗	🔴	🔴	🔴	🔴	🔴
24_misconfigured_pvc 🔗	🟢	🔴	🔴	🟡	🟢
43_current_datetime_from_prompt 🔗	🟡	🟡	🟢	🟢	🟢
61_exact_match_counting 🔗	🟢	🟢	🟢	🟢	🟢
73a_time_window_anomaly 🔗	🟡	🟢	🟡	🟡	🟡
73b_time_window_anomaly 🔗	🟢	🟡	🟡	🟡	🟢
96_no_matching_runbook 🔗	🟡	🟡	🟡	🟡	🟢
SUMMARY	🟡 70% (49/70)	🟡 53% (37/70)	🟡 51% (36/70)	🟡 63% (44/70)	🟡 84% (59/70)

Detailed Raw Results¶

Eval ID	deepseek-3.1	gpt-5	gpt-5.1	haiku-4.5	sonnet-4.5
09_crashpod 🔗	🟢 100% (5/5) / ⏱️ 83.5s	🟡 40% (⅖) / ⏱️ 15.1s / 💰 $0.04	🟡 60% (⅗) / ⏱️ 49.8s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 26.4s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 38.1s / 💰 $0.13
101_loki_historical_logs_pod_deleted 🔗	🟡 60% (⅗) / ⏱️ 110.9s	🟢 100% (5/5) / ⏱️ 29.9s / 💰 $0.08	🟢 100% (5/5) / ⏱️ 196.8s / 💰 $0.24	🟡 80% (⅘) / ⏱️ 36.0s / 💰 $0.05	🟢 100% (5/5) / ⏱️ 45.9s / 💰 $0.17
108_logs_nearby_lines 🔗	🔴 0% (0/5) / ⏱️ 134.4s	🔴 0% (0/5) / ⏱️ 33.2s / 💰 $0.09	🔴 0% (0/5) / ⏱️ 131.9s / 💰 $0.18	🟡 20% (⅕) / ⏱️ 36.5s / 💰 $0.06	🔴 0% (0/5) / ⏱️ 63.4s / 💰 $0.21
111_pod_names_contain_service 🔗	🟢 100% (5/5) / ⏱️ 95.7s	🔴 0% (0/5) / ⏱️ 12.4s / 💰 $0.02	🟡 40% (⅖) / ⏱️ 67.5s / 💰 $0.07	🟡 80% (⅘) / ⏱️ 24.0s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 53.5s / 💰 $0.23
12_job_crashing 🔗	🟡 80% (⅘) / ⏱️ 95.0s	🟡 20% (⅕) / ⏱️ 31.4s / 💰 $0.07	🟡 20% (⅕) / ⏱️ 85.4s / 💰 $0.11	🟢 100% (5/5) / ⏱️ 31.0s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 48.8s / 💰 $0.16
162_get_runbooks 🔗	🟡 40% (⅖) / ⏱️ 82.2s	🟡 60% (⅗) / ⏱️ 159.8s / 💰 $0.13	🟡 40% (⅖) / ⏱️ 96.1s / 💰 $0.12	🟡 60% (⅗) / ⏱️ 36.0s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 50.5s / 💰 $0.21
176_network_policy_blocking_traffic_no_runbooks 🔗	🟢 100% (5/5) / ⏱️ 93.0s	🟢 100% (5/5) / ⏱️ 35.3s / 💰 $0.09	🟡 80% (⅘) / ⏱️ 91.9s / 💰 $0.11	🟢 100% (5/5) / ⏱️ 37.8s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 49.4s / 💰 $0.21
179_grafana_big_dashboard_query 🔗	🔴 0% (0/5) / ⏱️ 94.4s	🔴 0% (0/5) / ⏱️ 21.2s / 💰 $0.04	🔴 0% (0/5) / ⏱️ 175.0s / 💰 $0.20	🔴 0% (0/5) / ⏱️ 29.5s / 💰 $0.04	🔴 0% (0/5) / ⏱️ 34.4s / 💰 $0.13
24_misconfigured_pvc 🔗	🟢 100% (5/5) / ⏱️ 103.6s	🔴 0% (0/5) / ⏱️ 4.3s / 💰 $0.00	🔴 0% (0/5) / ⏱️ 16.9s / 💰 $0.01	🟡 20% (⅕) / ⏱️ 9.9s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 42.5s / 💰 $0.14
43_current_datetime_from_prompt 🔗	🟡 80% (⅘) / ⏱️ 12.9s	🟡 80% (⅘) / ⏱️ 7.4s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 24.0s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 6.3s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 9.8s / 💰 $0.01
61_exact_match_counting 🔗	🟢 100% (5/5) / ⏱️ 35.6s	🟢 100% (5/5) / ⏱️ 16.4s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 49.8s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 14.0s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 19.1s / 💰 $0.06
73a_time_window_anomaly 🔗	🟡 80% (⅘) / ⏱️ 95.6s	🟢 100% (5/5) / ⏱️ 27.5s / 💰 $0.07	🟡 80% (⅘) / ⏱️ 93.0s / 💰 $0.13	🟡 20% (⅕) / ⏱️ 31.8s / 💰 $0.05	🟡 80% (⅘) / ⏱️ 48.6s / 💰 $0.20
73b_time_window_anomaly 🔗	🟢 100% (5/5) / ⏱️ 93.2s	🟡 80% (⅘) / ⏱️ 28.2s / 💰 $0.07	🟡 20% (⅕) / ⏱️ 37.3s / 💰 $0.03	🟡 20% (⅕) / ⏱️ 32.9s / 💰 $0.05	🟢 100% (5/5) / ⏱️ 46.3s / 💰 $0.18
96_no_matching_runbook 🔗	🟡 40% (⅖) / ⏱️ 99.9s	🟡 60% (⅗) / ⏱️ 41.0s / 💰 $0.13	🟡 80% (⅘) / ⏱️ 132.4s / 💰 $0.17	🟡 80% (⅘) / ⏱️ 33.5s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 60.6s / 💰 $0.30

Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: local-benchmark-20260101-140005.