β‘ January 22, 2026¶
Generated: 2026-01-22 07:43 UTC
Total Duration: 1h 27m 34s
Iterations: 1
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 49 | 21 | 0 | 70 | π‘ 70% (49/70) |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 58 | 12 | 0 | 70 | π‘ 83% (58/70) |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 59 | 11 | 0 | 70 | π‘ 84% (59/70) |
| gemini/gemini-3-flash-preview | 55 | 15 | 0 | 70 | π‘ 79% (55/70) |
| gemini/gemini-3-pro-preview | 48 | 22 | 0 | 70 | π‘ 69% (48/70) |
| gpt-5.2 | 50 | 20 | 0 | 70 | π‘ 71% (50/70) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 70 | $0.05 | $0.00 | $0.11 | $3.18 |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 70 | $0.20 | $0.01 | $0.49 | $13.73 |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 70 | $0.16 | $0.01 | $0.30 | $11.02 |
| gemini/gemini-3-flash-preview | 64 | $0.06 | $0.02 | $0.20 | $4.03 |
| gemini/gemini-3-pro-preview | 69 | $0.14 | $0.01 | $0.45 | $9.53 |
| gpt-5.2 | 70 | $0.09 | $0.00 | $0.26 | $6.34 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 26.8 | 3.8 | 53.6 | 27.1 | 51.4 |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 43.4 | 6.2 | 209.2 | 40.5 | 75.3 |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 39.6 | 4.9 | 65.4 | 42.6 | 60.4 |
| gemini/gemini-3-flash-preview | 67.2 | 9.8 | 437.7 | 33.3 | 366.5 |
| gemini/gemini-3-pro-preview | 72.3 | 11.7 | 290.0 | 58.6 | 169.4 |
| gpt-5.2 | 36.2 | 4.0 | 93.9 | 34.8 | 76.6 |
Performance by Tag¶
Success rate by test category and model:
| Tag | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gemini/gemini-3-flash-preview | gemini/gemini-3-pro-preview | gpt-5.2 | Warnings |
|---|---|---|---|---|---|---|---|
| benchmark | π‘ 60% (15/25) | π‘ 68% (17/25) | π‘ 72% (18/25) | π‘ 72% (18/25) | π‘ 64% (16/25) | π‘ 64% (16/25) | |
| context_window | π‘ 50% (5/10) | π’ 100% (10/10) | π‘ 80% (8/10) | π‘ 90% (9/10) | π‘ 50% (5/10) | π‘ 90% (9/10) | |
| counting | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | |
| datetime | π‘ 67% (10/15) | π’ 100% (15/15) | π‘ 87% (13/15) | π‘ 93% (14/15) | π‘ 67% (10/15) | π‘ 87% (13/15) | |
| easy | π‘ 74% (26/35) | π‘ 89% (31/35) | π‘ 89% (31/35) | π‘ 86% (30/35) | π‘ 77% (27/35) | π‘ 71% (25/35) | |
| grafana-dashboard | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | |
| hard | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | |
| kubernetes | π‘ 69% (24/35) | π‘ 89% (31/35) | π‘ 89% (31/35) | π‘ 80% (28/35) | π‘ 63% (22/35) | π‘ 83% (29/35) | |
| logs | π‘ 30% (6/20) | π‘ 55% (11/20) | π‘ 45% (9/20) | π‘ 70% (14/20) | π‘ 50% (10/20) | π‘ 50% (10/20) | |
| loki | π‘ 20% (⅕) | π‘ 20% (⅕) | π‘ 20% (⅕) | π‘ 40% (⅖) | π‘ 20% (⅕) | π‘ 20% (⅕) | |
| medium | π‘ 60% (18/30) | π‘ 73% (22/30) | π‘ 77% (23/30) | π‘ 67% (20/30) | π‘ 53% (16/30) | π‘ 67% (20/30) | |
| metrics | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | |
| network | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π‘ 80% (⅘) | π‘ 20% (⅕) | π’ 100% (5/5) | |
| one-test | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | |
| port-forward | π‘ 60% (6/10) | π‘ 60% (6/10) | π‘ 60% (6/10) | π‘ 70% (7/10) | π‘ 60% (6/10) | π‘ 60% (6/10) | |
| question-answer | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | π’ 100% (5/5) | |
| regression | π‘ 76% (34/45) | π‘ 91% (41/45) | π‘ 91% (41/45) | π‘ 82% (37/45) | π‘ 71% (32/45) | π‘ 76% (34/45) | |
| runbooks | π‘ 80% (8/10) | π‘ 70% (7/10) | π’ 100% (10/10) | π‘ 30% (3/10) | π‘ 20% (2/10) | π‘ 60% (6/10) | |
| Overall | π‘ 70% (49/70) | π‘ 83% (58/70) | π‘ 84% (59/70) | π‘ 79% (55/70) | π‘ 69% (48/70) | π‘ 71% (50/70) |
Raw Results¶
Status of all evaluations across models. Color coding:
- π’ Passing 100% (stable)
- π‘ Passing 1-99%
- π΄ Passing 0% (failing)
- π§ Mock data failure (missing or invalid test data)
- β οΈ Setup failure (environment/infrastructure issue)
- β±οΈ Timeout or rate limit error
- βοΈ Test skipped (e.g., known issue or precondition not met)
Detailed Raw Results¶
| Eval ID | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gemini/gemini-3-flash-preview | gemini/gemini-3-pro-preview | gpt-5.2 |
|---|---|---|---|---|---|---|
| 09_crashpod π | π’ 100% (5/5) / β±οΈ 27.4s / π° $0.04 | π’ 100% (5/5) / β±οΈ 31.7s / π° $0.13 | π’ 100% (5/5) / β±οΈ 32.6s / π° $0.11 | π’ 100% (5/5) / β±οΈ 22.4s / π° $0.04 | π’ 100% (5/5) / β±οΈ 40.5s / π° $0.07 | π’ 100% (5/5) / β±οΈ 30.3s / π° $0.07 |
| 101_loki_historical_logs_pod_deleted π | π‘ 20% (⅕) / β±οΈ 34.0s / π° $0.05 | π‘ 20% (⅕) / β±οΈ 120.8s / π° $0.34 | π‘ 20% (⅕) / β±οΈ 58.2s / π° $0.19 | π‘ 40% (⅖) / β±οΈ 195.7s / π° $0.04 | π‘ 20% (⅕) / β±οΈ 123.7s / π° $0.21 | π‘ 20% (⅕) / β±οΈ 55.7s / π° $0.13 |
| 108_logs_nearby_lines π | π΄ 0% (0/5) / β±οΈ 46.1s / π° $0.08 | π΄ 0% (0/5) / β±οΈ 52.4s / π° $0.29 | π΄ 0% (0/5) / β±οΈ 57.3s / π° $0.20 | π‘ 60% (⅗) / β±οΈ 131.5s / π° $0.09 | π‘ 80% (⅘) / β±οΈ 144.7s / π° $0.19 | π΄ 0% (0/5) / β±οΈ 68.3s / π° $0.18 |
| 111_pod_names_contain_service π | π’ 100% (5/5) / β±οΈ 24.2s / π° $0.03 | π’ 100% (5/5) / β±οΈ 31.5s / π° $0.12 | π’ 100% (5/5) / β±οΈ 42.7s / π° $0.14 | π’ 100% (5/5) / β±οΈ 25.7s / π° $0.04 | π’ 100% (5/5) / β±οΈ 46.9s / π° $0.08 | π’ 100% (5/5) / β±οΈ 50.3s / π° $0.12 |
| 12_job_crashing π | π’ 100% (5/5) / β±οΈ 26.8s / π° $0.03 | π’ 100% (5/5) / β±οΈ 45.6s / π° $0.19 | π’ 100% (5/5) / β±οΈ 47.8s / π° $0.16 | π‘ 80% (⅘) / β±οΈ 130.5s / π° $0.08 | π’ 100% (5/5) / β±οΈ 112.2s / π° $0.17 | π‘ 20% (⅕) / β±οΈ 43.7s / π° $0.09 |
| 162_get_runbooks π | π‘ 60% (⅗) / β±οΈ 27.6s / π° $0.06 | π’ 100% (5/5) / β±οΈ 44.9s / π° $0.32 | π’ 100% (5/5) / β±οΈ 49.3s / π° $0.23 | π‘ 40% (⅖) / β±οΈ 37.8s / π° $0.07 | π΄ 0% (0/5) / β±οΈ 61.7s / π° $0.17 | π‘ 80% (⅘) / β±οΈ 39.2s / π° $0.11 |
| 176_network_policy_blocking_traffic_no_runbooks π | π’ 100% (5/5) / β±οΈ 39.5s / π° $0.07 | π’ 100% (5/5) / β±οΈ 45.9s / π° $0.27 | π’ 100% (5/5) / β±οΈ 41.0s / π° $0.18 | π‘ 80% (⅘) / β±οΈ 25.8s / π° $0.04 | π‘ 20% (⅕) / β±οΈ 28.0s / π° $0.05 | π’ 100% (5/5) / β±οΈ 35.6s / π° $0.06 |
| 179_grafana_big_dashboard_query π | π’ 100% (5/5) / β±οΈ 21.2s / π° $0.04 | π’ 100% (5/5) / β±οΈ 26.5s / π° $0.18 | π’ 100% (5/5) / β±οΈ 23.7s / π° $0.12 | π’ 100% (5/5) / β±οΈ 14.9s / π° $0.04 | π’ 100% (5/5) / β±οΈ 26.4s / π° $0.08 | π’ 100% (5/5) / β±οΈ 18.0s / π° $0.04 |
| 24_misconfigured_pvc π | π΄ 0% (0/5) / β±οΈ 5.1s / π° $0.00 | π’ 100% (5/5) / β±οΈ 39.1s / π° $0.16 | π’ 100% (5/5) / β±οΈ 41.2s / π° $0.13 | π’ 100% (5/5) / β±οΈ 28.2s / π° $0.04 | π’ 100% (5/5) / β±οΈ 56.5s / π° $0.09 | π‘ 80% (⅘) / β±οΈ 33.9s / π° $0.08 |
| 43_current_datetime_from_prompt π | π’ 100% (5/5) / β±οΈ 4.0s / π° $0.00 | π’ 100% (5/5) / β±οΈ 6.4s / π° $0.03 | π’ 100% (5/5) / β±οΈ 7.0s / π° $0.03 | π’ 100% (5/5) / β±οΈ 10.3s / π° $0.02 | π’ 100% (5/5) / β±οΈ 12.8s / π° $0.02 | π‘ 80% (⅘) / β±οΈ 9.6s / π° $0.02 |
| 61_exact_match_counting π | π’ 100% (5/5) / β±οΈ 13.9s / π° $0.02 | π’ 100% (5/5) / β±οΈ 23.8s / π° $0.07 | π’ 100% (5/5) / β±οΈ 13.8s / π° $0.04 | π’ 100% (5/5) / β±οΈ 12.1s / π° $0.03 | π’ 100% (5/5) / β±οΈ 31.6s / π° $0.06 | π’ 100% (5/5) / β±οΈ 13.8s / π° $0.04 |
| 73a_time_window_anomaly π | π‘ 20% (⅕) / β±οΈ 27.2s / π° $0.04 | π’ 100% (5/5) / β±οΈ 37.3s / π° $0.14 | π‘ 60% (⅗) / β±οΈ 43.3s / π° $0.25 | π‘ 80% (⅘) / β±οΈ 58.9s / π° $0.10 | π΄ 0% (0/5) / β±οΈ 104.3s / π° $0.25 | π‘ 80% (⅘) / β±οΈ 35.0s / π° $0.09 |
| 73b_time_window_anomaly π | π‘ 80% (⅘) / β±οΈ 31.2s / π° $0.06 | π’ 100% (5/5) / β±οΈ 40.8s / π° $0.16 | π’ 100% (5/5) / β±οΈ 41.8s / π° $0.15 | π’ 100% (5/5) / β±οΈ 48.6s / π° $0.07 | π’ 100% (5/5) / β±οΈ 110.0s / π° $0.18 | π’ 100% (5/5) / β±οΈ 32.1s / π° $0.09 |
| 96_no_matching_runbook π | π’ 100% (5/5) / β±οΈ 47.8s / π° $0.10 | π‘ 40% (⅖) / β±οΈ 61.0s / π° $0.35 | π’ 100% (5/5) / β±οΈ 54.1s / π° $0.26 | π‘ 20% (⅕) / β±οΈ 198.1s / π° $0.10 | π‘ 40% (⅖) / β±οΈ 112.8s / π° $0.28 | π‘ 40% (⅖) / β±οΈ 41.9s / π° $0.12 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-21238150400.