⚡ January 29, 2026¶

Generated: 2026-01-29 09:48 UTC
Total Duration: 56m 56s
Iterations: 1
Judge (classifier) model: gpt-4.1

Fast Benchmark

Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes

HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.

If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.

Model Accuracy Comparison¶

Model	Pass	Fail	Total	Success Rate
deepseek-chat	11	3	14	🟡 79% (11/14)
deepseek-reasoner	9	5	14	🟡 64% (9/14)
gemini-3-flash-preview	7	7	14	🟡 50% (7/14)
gemini-3-pro-preview	8	6	14	🟡 57% (8/14)
gpt-5.2-high-reasoning	9	5	14	🟡 64% (9/14)
haiku-4.5	10	4	14	🟡 71% (10/14)
kimi-2.5-openrouter	11	3	14	🟡 79% (11/14)
opus-4.5	12	2	14	🟡 86% (12/14)
sonnet-4.5	13	1	14	🟡 93% (13/14)

Model Cost Comparison¶

Model	Tests	Avg Cost	Min Cost	Max Cost	Total Cost
deepseek-chat	14	$0.01	$0.00	$0.03	$0.20
deepseek-reasoner	14	$0.02	$0.00	$0.03	$0.23
gemini-3-flash-preview	13	$0.06	$0.02	$0.16	$0.80
gemini-3-pro-preview	11	$0.14	$0.04	$0.25	$1.59
gpt-5.2-high-reasoning	14	$0.25	$0.01	$0.75	$3.56
haiku-4.5	14	$0.05	$0.02	$0.10	$0.76
opus-4.5	14	$0.25	$0.11	$0.38	$3.45
sonnet-4.5	14	$0.19	$0.07	$0.31	$2.64

Model Latency Comparison¶

Model	Avg (s)	Min (s)	Max (s)	P50 (s)	P95 (s)
deepseek-chat	131.8	52.5	205.2	143.4	205.2
deepseek-reasoner	222.4	30.3	430.3	255.2	430.3
gemini-3-flash-preview	38.0	11.3	104.3	34.6	104.3
gemini-3-pro-preview	69.8	7.0	128.0	76.1	128.0
gpt-5.2-high-reasoning	256.1	10.7	836.0	254.6	836.0
haiku-4.5	27.4	7.6	56.3	27.8	56.3
kimi-2.5-openrouter	70.7	12.2	218.4	66.0	218.4
opus-4.5	38.3	11.4	62.5	38.1	62.5
sonnet-4.5	39.7	7.6	66.8	42.5	66.8

Performance by Tag¶

Success rate by test category and model:

Tag	deepseek-chat	deepseek-reasoner	gemini-3-flash-preview	gemini-3-pro-preview	gpt-5.2-high-reasoning	haiku-4.5	kimi-2.5-openrouter	opus-4.5	sonnet-4.5
benchmark	🟡 60% (⅗)	🟡 60% (⅗)	🟡 60% (⅗)	🟡 60% (⅗)	🟡 60% (⅗)	🟡 60% (⅗)	🟡 60% (⅗)	🟡 60% (⅗)	🟡 80% (⅘)
context_window	🟢 100% (2/2)	🟡 50% (½)	🟡 50% (½)	🔴 0% (0/2)	🟢 100% (2/2)	🟡 50% (½)	🟡 50% (½)	🟢 100% (2/2)	🟢 100% (2/2)
counting	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
datetime	🟡 67% (⅔)	🟡 67% (⅔)	🟡 33% (⅓)	🔴 0% (0/3)	🟢 100% (3/3)	🟡 67% (⅔)	🟡 67% (⅔)	🟢 100% (3/3)	🟢 100% (3/3)
easy	🟡 86% (6/7)	🟡 57% (4/7)	🟡 43% (3/7)	🟡 57% (4/7)	🟡 71% (5/7)	🟡 86% (6/7)	🟢 100% (7/7)	🟢 100% (7/7)	🟢 100% (7/7)
grafana-dashboard	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
hard	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
kubernetes	🟢 100% (7/7)	🟡 57% (4/7)	🟡 57% (4/7)	🟡 57% (4/7)	🟡 57% (4/7)	🟡 71% (5/7)	🟡 86% (6/7)	🟢 100% (7/7)	🟢 100% (7/7)
logs	🟡 75% (¾)	🟡 25% (¼)	🟡 50% (2/4)	🟡 25% (¼)	🟡 75% (¾)	🟡 50% (2/4)	🟡 50% (2/4)	🟡 75% (¾)	🟡 75% (¾)
loki	🟢 100% (1/1)	🔴 0% (0/1)	🔴 0% (0/1)	🔴 0% (0/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
medium	🟡 67% (4/6)	🟡 67% (4/6)	🟡 50% (3/6)	🟡 50% (3/6)	🟡 50% (3/6)	🟡 50% (3/6)	🟡 50% (3/6)	🟡 67% (4/6)	🟡 83% (⅚)
metrics	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
network	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🔴 0% (0/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
one-test	🟢 100% (1/1)	🔴 0% (0/1)	🟢 100% (1/1)	🟢 100% (1/1)	🔴 0% (0/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
port-forward	🟢 100% (2/2)	🟡 50% (½)	🟡 50% (½)	🟡 50% (½)	🟢 100% (2/2)	🟢 100% (2/2)	🟢 100% (2/2)	🟢 100% (2/2)	🟢 100% (2/2)
question-answer	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)	🟢 100% (1/1)
regression	🟡 89% (8/9)	🟡 67% (6/9)	🟡 44% (4/9)	🟡 56% (5/9)	🟡 67% (6/9)	🟡 78% (7/9)	🟡 89% (8/9)	🟢 100% (9/9)	🟢 100% (9/9)
runbooks	🟡 50% (½)	🟢 100% (2/2)	🔴 0% (0/2)	🟡 50% (½)	🟡 50% (½)	🟡 50% (½)	🟡 50% (½)	🟡 50% (½)	🟢 100% (2/2)
Overall	🟡 79% (11/14)	🟡 64% (9/14)	🟡 50% (7/14)	🟡 57% (8/14)	🟡 64% (9/14)	🟡 71% (10/14)	🟡 79% (11/14)	🟡 86% (12/14)	🟡 93% (13/14)

Raw Results¶

Status of all evaluations across models. Color coding:

🟢 Passing 100% (stable)
🟡 Passing 1-99%
🔴 Passing 0% (failing)
🔧 Mock data failure (missing or invalid test data)
⚠️ Setup failure (environment/infrastructure issue)
⏱️ Timeout or rate limit error
⏭️ Test skipped (e.g., known issue or precondition not met)

Eval ID	deepseek-chat	deepseek-reasoner	gemini-3-flash-preview	gemini-3-pro-preview	gpt-5.2-high-reasoning	haiku-4.5	kimi-2.5-openrouter	opus-4.5	sonnet-4.5
09_crashpod 🔗	🟢	🔴	🟢	🟢	🔴	🟢	🟢	🟢	🟢
101_loki_historical_logs_pod_deleted 🔗	🟢	🔴	🔴	🔴	🟢	🟢	🟢	🟢	🟢
108_logs_nearby_lines 🔗	🔴	🔴	🟢	🟢	🔴	🔴	🔴	🔴	🔴
111_pod_names_contain_service 🔗	🟢	🟢	🟢	🟢	🔴	🟢	🟢	🟢	🟢
12_job_crashing 🔗	🟢	🟢	🔴	🟢	🟢	🟢	🟢	🟢	🟢
162_get_runbooks 🔗	🟢	🟢	🔴	🔴	🟢	🔴	🔴	🟢	🟢
176_network_policy_blocking_traffic_no_runbooks 🔗	🟢	🟢	🟢	🔴	🟢	🟢	🟢	🟢	🟢
179_grafana_big_dashboard_query 🔗	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢
24_misconfigured_pvc 🔗	🟢	🔴	🔴	🟢	🔴	🔴	🟢	🟢	🟢
43_current_datetime_from_prompt 🔗	🔴	🟢	🔴	🔴	🟢	🟢	🟢	🟢	🟢
61_exact_match_counting 🔗	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢
73a_time_window_anomaly 🔗	🟢	🔴	🟢	🔴	🟢	🟢	🔴	🟢	🟢
73b_time_window_anomaly 🔗	🟢	🟢	🔴	🔴	🟢	🔴	🟢	🟢	🟢
96_no_matching_runbook 🔗	🔴	🟢	🔴	🟢	🔴	🟢	🟢	🔴	🟢
SUMMARY	🟡 79% (11/14)	🟡 64% (9/14)	🟡 50% (7/14)	🟡 57% (8/14)	🟡 64% (9/14)	🟡 71% (10/14)	🟡 79% (11/14)	🟡 86% (12/14)	🟡 93% (13/14)

Detailed Raw Results¶

Eval ID	deepseek-chat	deepseek-reasoner	gemini-3-flash-preview	gemini-3-pro-preview	gpt-5.2-high-reasoning	haiku-4.5	kimi-2.5-openrouter	opus-4.5	sonnet-4.5
09_crashpod 🔗	🟢 100% (1/1) / ⏱️ 152.2s / 💰 $0.01	🔴 0% (0/1) / ⏱️ 44.8s / 💰 $0.00	🟢 100% (1/1) / ⏱️ 25.8s / 💰 $0.04	🟢 100% (1/1) / ⏱️ 47.2s / 💰 $0.09	🔴 0% (0/1) / ⏱️ 46.9s / 💰 $0.03	🟢 100% (1/1) / ⏱️ 27.8s / 💰 $0.06	🟢 100% (1/1) / ⏱️ 62.3s	🟢 100% (1/1) / ⏱️ 36.5s / 💰 $0.21	🟢 100% (1/1) / ⏱️ 37.2s / 💰 $0.18
101_loki_historical_logs_pod_deleted 🔗	🟢 100% (1/1) / ⏱️ 157.2s / 💰 $0.02	🔴 0% (0/1) / ⏱️ 430.3s / 💰 $0.03	🔴 0% (0/1) / ⏱️ 104.3s	🔴 0% (0/1) / ⏱️ 79.4s	🟢 100% (1/1) / ⏱️ 590.4s / 💰 $0.62	🟢 100% (1/1) / ⏱️ 32.0s / 💰 $0.06	🟢 100% (1/1) / ⏱️ 97.7s	🟢 100% (1/1) / ⏱️ 36.4s / 💰 $0.20	🟢 100% (1/1) / ⏱️ 42.5s / 💰 $0.18
108_logs_nearby_lines 🔗	🔴 0% (0/1) / ⏱️ 143.4s / 💰 $0.02	🔴 0% (0/1) / ⏱️ 323.1s / 💰 $0.03	🟢 100% (1/1) / ⏱️ 52.2s / 💰 $0.08	🟢 100% (1/1) / ⏱️ 97.4s / 💰 $0.22	🔴 0% (0/1) / ⏱️ 493.6s / 💰 $0.44	🔴 0% (0/1) / ⏱️ 31.3s / 💰 $0.06	🔴 0% (0/1) / ⏱️ 34.6s	🔴 0% (0/1) / ⏱️ 49.2s / 💰 $0.38	🔴 0% (0/1) / ⏱️ 58.4s / 💰 $0.24
111_pod_names_contain_service 🔗	🟢 100% (1/1) / ⏱️ 130.2s / 💰 $0.01	🟢 100% (1/1) / ⏱️ 193.3s / 💰 $0.01	🟢 100% (1/1) / ⏱️ 22.8s / 💰 $0.03	🟢 100% (1/1) / ⏱️ 39.4s / 💰 $0.06	🔴 0% (0/1) / ⏱️ 29.5s / 💰 $0.01	🟢 100% (1/1) / ⏱️ 25.9s / 💰 $0.06	🟢 100% (1/1) / ⏱️ 73.3s	🟢 100% (1/1) / ⏱️ 37.6s / 💰 $0.23	🟢 100% (1/1) / ⏱️ 35.8s / 💰 $0.15
12_job_crashing 🔗	🟢 100% (1/1) / ⏱️ 192.4s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 226.7s / 💰 $0.01	🔴 0% (0/1) / ⏱️ 51.3s / 💰 $0.10	🟢 100% (1/1) / ⏱️ 128.0s / 💰 $0.24	🟢 100% (1/1) / ⏱️ 348.2s / 💰 $0.32	🟢 100% (1/1) / ⏱️ 35.9s / 💰 $0.07	🟢 100% (1/1) / ⏱️ 81.4s	🟢 100% (1/1) / ⏱️ 38.1s / 💰 $0.24	🟢 100% (1/1) / ⏱️ 40.4s / 💰 $0.20
162_get_runbooks 🔗	🟢 100% (1/1) / ⏱️ 129.1s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 265.0s / 💰 $0.03	🔴 0% (0/1) / ⏱️ 39.2s / 💰 $0.07	🔴 0% (0/1) / ⏱️ 62.4s / 💰 $0.20	🟢 100% (1/1) / ⏱️ 261.5s / 💰 $0.30	🔴 0% (0/1) / ⏱️ 27.7s / 💰 $0.06	🔴 0% (0/1) / ⏱️ 41.2s	🟢 100% (1/1) / ⏱️ 44.0s / 💰 $0.32	🟢 100% (1/1) / ⏱️ 53.0s / 💰 $0.25
176_network_policy_blocking_traffic_no_runbooks 🔗	🟢 100% (1/1) / ⏱️ 167.9s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 307.8s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 41.8s / 💰 $0.05	🔴 0% (0/1) / ⏱️ 7.0s	🟢 100% (1/1) / ⏱️ 254.6s / 💰 $0.21	🟢 100% (1/1) / ⏱️ 37.0s / 💰 $0.06	🟢 100% (1/1) / ⏱️ 85.7s	🟢 100% (1/1) / ⏱️ 52.7s / 💰 $0.32	🟢 100% (1/1) / ⏱️ 44.7s / 💰 $0.21
179_grafana_big_dashboard_query 🔗	🟢 100% (1/1) / ⏱️ 62.9s / 💰 $0.01	🟢 100% (1/1) / ⏱️ 154.6s / 💰 $0.01	🟢 100% (1/1) / ⏱️ 16.5s / 💰 $0.03	🟢 100% (1/1) / ⏱️ 40.4s / 💰 $0.09	🟢 100% (1/1) / ⏱️ 46.5s / 💰 $0.09	🟢 100% (1/1) / ⏱️ 23.8s / 💰 $0.05	🟢 100% (1/1) / ⏱️ 46.0s	🟢 100% (1/1) / ⏱️ 24.1s / 💰 $0.19	🟢 100% (1/1) / ⏱️ 21.5s / 💰 $0.12
24_misconfigured_pvc 🔗	🟢 100% (1/1) / ⏱️ 97.1s / 💰 $0.01	🔴 0% (0/1) / ⏱️ 30.3s / 💰 $0.00	🔴 0% (0/1) / ⏱️ 17.8s / 💰 $0.04	🟢 100% (1/1) / ⏱️ 76.1s / 💰 $0.09	🔴 0% (0/1) / ⏱️ 10.7s / 💰 $0.01	🔴 0% (0/1) / ⏱️ 7.6s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 66.0s	🟢 100% (1/1) / ⏱️ 42.1s / 💰 $0.26	🟢 100% (1/1) / ⏱️ 38.8s / 💰 $0.18
43_current_datetime_from_prompt 🔗	🔴 0% (0/1) / ⏱️ 52.5s / 💰 $0.00	🟢 100% (1/1) / ⏱️ 41.5s / 💰 $0.00	🔴 0% (0/1) / ⏱️ 15.2s / 💰 $0.03	🔴 0% (0/1) / ⏱️ 21.2s / 💰 $0.04	🟢 100% (1/1) / ⏱️ 33.0s / 💰 $0.03	🟢 100% (1/1) / ⏱️ 8.0s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 12.2s	🟢 100% (1/1) / ⏱️ 11.4s / 💰 $0.11	🟢 100% (1/1) / ⏱️ 7.6s / 💰 $0.07
61_exact_match_counting 🔗	🟢 100% (1/1) / ⏱️ 68.5s / 💰 $0.01	🟢 100% (1/1) / ⏱️ 97.5s / 💰 $0.01	🟢 100% (1/1) / ⏱️ 11.3s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 104.4s / 💰 $0.07	🟢 100% (1/1) / ⏱️ 74.2s / 💰 $0.09	🟢 100% (1/1) / ⏱️ 15.4s / 💰 $0.04	🟢 100% (1/1) / ⏱️ 14.2s	🟢 100% (1/1) / ⏱️ 29.3s / 💰 $0.16	🟢 100% (1/1) / ⏱️ 19.5s / 💰 $0.11
73a_time_window_anomaly 🔗	🟢 100% (1/1) / ⏱️ 104.4s / 💰 $0.01	🔴 0% (0/1) / ⏱️ 320.8s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 79.6s / 💰 $0.16	🔴 0% (0/1) / ⏱️ 55.6s	🟢 100% (1/1) / ⏱️ 309.2s / 💰 $0.45	🟢 100% (1/1) / ⏱️ 27.1s / 💰 $0.05	🔴 0% (0/1) / ⏱️ 104.4s	🟢 100% (1/1) / ⏱️ 34.1s / 💰 $0.22	🟢 100% (1/1) / ⏱️ 44.3s / 💰 $0.22
73b_time_window_anomaly 🔗	🟢 100% (1/1) / ⏱️ 181.8s / 💰 $0.02	🟢 100% (1/1) / ⏱️ 255.2s / 💰 $0.02	🔴 0% (0/1) / ⏱️ 19.0s / 💰 $0.04	🔴 0% (0/1) / ⏱️ 101.4s / 💰 $0.23	🟢 100% (1/1) / ⏱️ 251.9s / 💰 $0.23	🔴 0% (0/1) / ⏱️ 28.2s / 💰 $0.05	🟢 100% (1/1) / ⏱️ 52.1s	🟢 100% (1/1) / ⏱️ 38.8s / 💰 $0.23	🟢 100% (1/1) / ⏱️ 45.8s / 💰 $0.20
96_no_matching_runbook 🔗	🔴 0% (0/1) / ⏱️ 205.2s / 💰 $0.03	🟢 100% (1/1) / ⏱️ 422.5s / 💰 $0.03	🔴 0% (0/1) / ⏱️ 34.6s / 💰 $0.10	🟢 100% (1/1) / ⏱️ 117.1s / 💰 $0.25	🔴 0% (0/1) / ⏱️ 836.0s / 💰 $0.75	🟢 100% (1/1) / ⏱️ 56.3s / 💰 $0.10	🟢 100% (1/1) / ⏱️ 218.4s	🔴 0% (0/1) / ⏱️ 62.5s / 💰 $0.37	🟢 100% (1/1) / ⏱️ 66.8s / 💰 $0.31

Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-21471579810.