Skip to content

⚑ January 22, 2026

Generated: 2026-01-22 07:43 UTC
Total Duration: 1h 27m 34s
Iterations: 1
Judge (classifier) model: gpt-4.1

Fast Benchmark

Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes

HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.

If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.

Model Accuracy Comparison

Model Pass Fail Skip/Error Total Success Rate
eu.anthropic.claude-haiku-4-5-20251001-v1:0 49 21 0 70 🟑 70% (49/70)
eu.anthropic.claude-opus-4-5-20251101-v1:0 58 12 0 70 🟑 83% (58/70)
eu.anthropic.claude-sonnet-4-5-20250929-v1:0 59 11 0 70 🟑 84% (59/70)
gemini/gemini-3-flash-preview 55 15 0 70 🟑 79% (55/70)
gemini/gemini-3-pro-preview 48 22 0 70 🟑 69% (48/70)
gpt-5.2 50 20 0 70 🟑 71% (50/70)

Model Cost Comparison

Model Tests Avg Cost Min Cost Max Cost Total Cost
eu.anthropic.claude-haiku-4-5-20251001-v1:0 70 $0.05 $0.00 $0.11 $3.18
eu.anthropic.claude-opus-4-5-20251101-v1:0 70 $0.20 $0.01 $0.49 $13.73
eu.anthropic.claude-sonnet-4-5-20250929-v1:0 70 $0.16 $0.01 $0.30 $11.02
gemini/gemini-3-flash-preview 64 $0.06 $0.02 $0.20 $4.03
gemini/gemini-3-pro-preview 69 $0.14 $0.01 $0.45 $9.53
gpt-5.2 70 $0.09 $0.00 $0.26 $6.34

Model Latency Comparison

Model Avg (s) Min (s) Max (s) P50 (s) P95 (s)
eu.anthropic.claude-haiku-4-5-20251001-v1:0 26.8 3.8 53.6 27.1 51.4
eu.anthropic.claude-opus-4-5-20251101-v1:0 43.4 6.2 209.2 40.5 75.3
eu.anthropic.claude-sonnet-4-5-20250929-v1:0 39.6 4.9 65.4 42.6 60.4
gemini/gemini-3-flash-preview 67.2 9.8 437.7 33.3 366.5
gemini/gemini-3-pro-preview 72.3 11.7 290.0 58.6 169.4
gpt-5.2 36.2 4.0 93.9 34.8 76.6

Performance by Tag

Success rate by test category and model:

Tag eu.anthropic.claude-haiku-4-5-20251001-v1:0 eu.anthropic.claude-opus-4-5-20251101-v1:0 eu.anthropic.claude-sonnet-4-5-20250929-v1:0 gemini/gemini-3-flash-preview gemini/gemini-3-pro-preview gpt-5.2 Warnings
benchmark 🟑 60% (15/25) 🟑 68% (17/25) 🟑 72% (18/25) 🟑 72% (18/25) 🟑 64% (16/25) 🟑 64% (16/25)
context_window 🟑 50% (5/10) 🟒 100% (10/10) 🟑 80% (8/10) 🟑 90% (9/10) 🟑 50% (5/10) 🟑 90% (9/10)
counting 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5)
datetime 🟑 67% (10/15) 🟒 100% (15/15) 🟑 87% (13/15) 🟑 93% (14/15) 🟑 67% (10/15) 🟑 87% (13/15)
easy 🟑 74% (26/35) 🟑 89% (31/35) 🟑 89% (31/35) 🟑 86% (30/35) 🟑 77% (27/35) 🟑 71% (25/35)
grafana-dashboard 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5)
hard 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5)
kubernetes 🟑 69% (24/35) 🟑 89% (31/35) 🟑 89% (31/35) 🟑 80% (28/35) 🟑 63% (22/35) 🟑 83% (29/35)
logs 🟑 30% (6/20) 🟑 55% (11/20) 🟑 45% (9/20) 🟑 70% (14/20) 🟑 50% (10/20) 🟑 50% (10/20)
loki 🟑 20% (⅕) 🟑 20% (⅕) 🟑 20% (⅕) 🟑 40% (⅖) 🟑 20% (⅕) 🟑 20% (⅕)
medium 🟑 60% (18/30) 🟑 73% (22/30) 🟑 77% (23/30) 🟑 67% (20/30) 🟑 53% (16/30) 🟑 67% (20/30)
metrics 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5)
network 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟑 80% (⅘) 🟑 20% (⅕) 🟒 100% (5/5)
one-test 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5)
port-forward 🟑 60% (6/10) 🟑 60% (6/10) 🟑 60% (6/10) 🟑 70% (7/10) 🟑 60% (6/10) 🟑 60% (6/10)
question-answer 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5) 🟒 100% (5/5)
regression 🟑 76% (34/45) 🟑 91% (41/45) 🟑 91% (41/45) 🟑 82% (37/45) 🟑 71% (32/45) 🟑 76% (34/45)
runbooks 🟑 80% (8/10) 🟑 70% (7/10) 🟒 100% (10/10) 🟑 30% (3/10) 🟑 20% (2/10) 🟑 60% (6/10)
Overall 🟑 70% (49/70) 🟑 83% (58/70) 🟑 84% (59/70) 🟑 79% (55/70) 🟑 69% (48/70) 🟑 71% (50/70)

Raw Results

Status of all evaluations across models. Color coding:

  • 🟒 Passing 100% (stable)
  • 🟑 Passing 1-99%
  • πŸ”΄ Passing 0% (failing)
  • πŸ”§ Mock data failure (missing or invalid test data)
  • ⚠️ Setup failure (environment/infrastructure issue)
  • ⏱️ Timeout or rate limit error
  • ⏭️ Test skipped (e.g., known issue or precondition not met)
Eval ID eu.anthropic.claude-haiku-4-5-20251001-v1:0 eu.anthropic.claude-opus-4-5-20251101-v1:0 eu.anthropic.claude-sonnet-4-5-20250929-v1:0 gemini/gemini-3-flash-preview gemini/gemini-3-pro-preview gpt-5.2
09_crashpod πŸ”— 🟒 🟒 🟒 🟒 🟒 🟒
101_loki_historical_logs_pod_deleted πŸ”— 🟑 🟑 🟑 🟑 🟑 🟑
108_logs_nearby_lines πŸ”— πŸ”΄ πŸ”΄ πŸ”΄ 🟑 🟑 πŸ”΄
111_pod_names_contain_service πŸ”— 🟒 🟒 🟒 🟒 🟒 🟒
12_job_crashing πŸ”— 🟒 🟒 🟒 🟑 🟒 🟑
162_get_runbooks πŸ”— 🟑 🟒 🟒 🟑 πŸ”΄ 🟑
176_network_policy_blocking_traffic_no_runbooks πŸ”— 🟒 🟒 🟒 🟑 🟑 🟒
179_grafana_big_dashboard_query πŸ”— 🟒 🟒 🟒 🟒 🟒 🟒
24_misconfigured_pvc πŸ”— πŸ”΄ 🟒 🟒 🟒 🟒 🟑
43_current_datetime_from_prompt πŸ”— 🟒 🟒 🟒 🟒 🟒 🟑
61_exact_match_counting πŸ”— 🟒 🟒 🟒 🟒 🟒 🟒
73a_time_window_anomaly πŸ”— 🟑 🟒 🟑 🟑 πŸ”΄ 🟑
73b_time_window_anomaly πŸ”— 🟑 🟒 🟒 🟒 🟒 🟒
96_no_matching_runbook πŸ”— 🟒 🟑 🟒 🟑 🟑 🟑
SUMMARY 🟑 70% (49/70) 🟑 83% (58/70) 🟑 84% (59/70) 🟑 79% (55/70) 🟑 69% (48/70) 🟑 71% (50/70)

Detailed Raw Results

Eval ID eu.anthropic.claude-haiku-4-5-20251001-v1:0 eu.anthropic.claude-opus-4-5-20251101-v1:0 eu.anthropic.claude-sonnet-4-5-20250929-v1:0 gemini/gemini-3-flash-preview gemini/gemini-3-pro-preview gpt-5.2
09_crashpod πŸ”— 🟒 100% (5/5) / ⏱️ 27.4s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 31.7s / πŸ’° $0.13 🟒 100% (5/5) / ⏱️ 32.6s / πŸ’° $0.11 🟒 100% (5/5) / ⏱️ 22.4s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 40.5s / πŸ’° $0.07 🟒 100% (5/5) / ⏱️ 30.3s / πŸ’° $0.07
101_loki_historical_logs_pod_deleted πŸ”— 🟑 20% (⅕) / ⏱️ 34.0s / πŸ’° $0.05 🟑 20% (⅕) / ⏱️ 120.8s / πŸ’° $0.34 🟑 20% (⅕) / ⏱️ 58.2s / πŸ’° $0.19 🟑 40% (⅖) / ⏱️ 195.7s / πŸ’° $0.04 🟑 20% (⅕) / ⏱️ 123.7s / πŸ’° $0.21 🟑 20% (⅕) / ⏱️ 55.7s / πŸ’° $0.13
108_logs_nearby_lines πŸ”— πŸ”΄ 0% (0/5) / ⏱️ 46.1s / πŸ’° $0.08 πŸ”΄ 0% (0/5) / ⏱️ 52.4s / πŸ’° $0.29 πŸ”΄ 0% (0/5) / ⏱️ 57.3s / πŸ’° $0.20 🟑 60% (⅗) / ⏱️ 131.5s / πŸ’° $0.09 🟑 80% (⅘) / ⏱️ 144.7s / πŸ’° $0.19 πŸ”΄ 0% (0/5) / ⏱️ 68.3s / πŸ’° $0.18
111_pod_names_contain_service πŸ”— 🟒 100% (5/5) / ⏱️ 24.2s / πŸ’° $0.03 🟒 100% (5/5) / ⏱️ 31.5s / πŸ’° $0.12 🟒 100% (5/5) / ⏱️ 42.7s / πŸ’° $0.14 🟒 100% (5/5) / ⏱️ 25.7s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 46.9s / πŸ’° $0.08 🟒 100% (5/5) / ⏱️ 50.3s / πŸ’° $0.12
12_job_crashing πŸ”— 🟒 100% (5/5) / ⏱️ 26.8s / πŸ’° $0.03 🟒 100% (5/5) / ⏱️ 45.6s / πŸ’° $0.19 🟒 100% (5/5) / ⏱️ 47.8s / πŸ’° $0.16 🟑 80% (⅘) / ⏱️ 130.5s / πŸ’° $0.08 🟒 100% (5/5) / ⏱️ 112.2s / πŸ’° $0.17 🟑 20% (⅕) / ⏱️ 43.7s / πŸ’° $0.09
162_get_runbooks πŸ”— 🟑 60% (⅗) / ⏱️ 27.6s / πŸ’° $0.06 🟒 100% (5/5) / ⏱️ 44.9s / πŸ’° $0.32 🟒 100% (5/5) / ⏱️ 49.3s / πŸ’° $0.23 🟑 40% (⅖) / ⏱️ 37.8s / πŸ’° $0.07 πŸ”΄ 0% (0/5) / ⏱️ 61.7s / πŸ’° $0.17 🟑 80% (⅘) / ⏱️ 39.2s / πŸ’° $0.11
176_network_policy_blocking_traffic_no_runbooks πŸ”— 🟒 100% (5/5) / ⏱️ 39.5s / πŸ’° $0.07 🟒 100% (5/5) / ⏱️ 45.9s / πŸ’° $0.27 🟒 100% (5/5) / ⏱️ 41.0s / πŸ’° $0.18 🟑 80% (⅘) / ⏱️ 25.8s / πŸ’° $0.04 🟑 20% (⅕) / ⏱️ 28.0s / πŸ’° $0.05 🟒 100% (5/5) / ⏱️ 35.6s / πŸ’° $0.06
179_grafana_big_dashboard_query πŸ”— 🟒 100% (5/5) / ⏱️ 21.2s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 26.5s / πŸ’° $0.18 🟒 100% (5/5) / ⏱️ 23.7s / πŸ’° $0.12 🟒 100% (5/5) / ⏱️ 14.9s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 26.4s / πŸ’° $0.08 🟒 100% (5/5) / ⏱️ 18.0s / πŸ’° $0.04
24_misconfigured_pvc πŸ”— πŸ”΄ 0% (0/5) / ⏱️ 5.1s / πŸ’° $0.00 🟒 100% (5/5) / ⏱️ 39.1s / πŸ’° $0.16 🟒 100% (5/5) / ⏱️ 41.2s / πŸ’° $0.13 🟒 100% (5/5) / ⏱️ 28.2s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 56.5s / πŸ’° $0.09 🟑 80% (⅘) / ⏱️ 33.9s / πŸ’° $0.08
43_current_datetime_from_prompt πŸ”— 🟒 100% (5/5) / ⏱️ 4.0s / πŸ’° $0.00 🟒 100% (5/5) / ⏱️ 6.4s / πŸ’° $0.03 🟒 100% (5/5) / ⏱️ 7.0s / πŸ’° $0.03 🟒 100% (5/5) / ⏱️ 10.3s / πŸ’° $0.02 🟒 100% (5/5) / ⏱️ 12.8s / πŸ’° $0.02 🟑 80% (⅘) / ⏱️ 9.6s / πŸ’° $0.02
61_exact_match_counting πŸ”— 🟒 100% (5/5) / ⏱️ 13.9s / πŸ’° $0.02 🟒 100% (5/5) / ⏱️ 23.8s / πŸ’° $0.07 🟒 100% (5/5) / ⏱️ 13.8s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 12.1s / πŸ’° $0.03 🟒 100% (5/5) / ⏱️ 31.6s / πŸ’° $0.06 🟒 100% (5/5) / ⏱️ 13.8s / πŸ’° $0.04
73a_time_window_anomaly πŸ”— 🟑 20% (⅕) / ⏱️ 27.2s / πŸ’° $0.04 🟒 100% (5/5) / ⏱️ 37.3s / πŸ’° $0.14 🟑 60% (⅗) / ⏱️ 43.3s / πŸ’° $0.25 🟑 80% (⅘) / ⏱️ 58.9s / πŸ’° $0.10 πŸ”΄ 0% (0/5) / ⏱️ 104.3s / πŸ’° $0.25 🟑 80% (⅘) / ⏱️ 35.0s / πŸ’° $0.09
73b_time_window_anomaly πŸ”— 🟑 80% (⅘) / ⏱️ 31.2s / πŸ’° $0.06 🟒 100% (5/5) / ⏱️ 40.8s / πŸ’° $0.16 🟒 100% (5/5) / ⏱️ 41.8s / πŸ’° $0.15 🟒 100% (5/5) / ⏱️ 48.6s / πŸ’° $0.07 🟒 100% (5/5) / ⏱️ 110.0s / πŸ’° $0.18 🟒 100% (5/5) / ⏱️ 32.1s / πŸ’° $0.09
96_no_matching_runbook πŸ”— 🟒 100% (5/5) / ⏱️ 47.8s / πŸ’° $0.10 🟑 40% (⅖) / ⏱️ 61.0s / πŸ’° $0.35 🟒 100% (5/5) / ⏱️ 54.1s / πŸ’° $0.26 🟑 20% (⅕) / ⏱️ 198.1s / πŸ’° $0.10 🟑 40% (⅖) / ⏱️ 112.8s / πŸ’° $0.28 🟑 40% (⅖) / ⏱️ 41.9s / πŸ’° $0.12

Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-21238150400.