⚡ January 27, 2026¶
Generated: 2026-01-27 16:11 UTC
Total Duration: 1h 28m 43s
Iterations: 1
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek/deepseek-chat | 10 | 4 | 0 | 14 | 🟡 71% (10/14) |
| deepseek/deepseek-reasoner | 10 | 4 | 0 | 14 | 🟡 71% (10/14) |
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 11 | 3 | 0 | 14 | 🟡 79% (11/14) |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 11 | 3 | 0 | 14 | 🟡 79% (11/14) |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 10 | 4 | 0 | 14 | 🟡 71% (10/14) |
| gemini/gemini-3-flash-preview | 10 | 4 | 0 | 14 | 🟡 71% (10/14) |
| gemini/gemini-3-pro-preview | 7 | 7 | 0 | 14 | 🟡 50% (7/14) |
| gpt-5.2 | 9 | 5 | 0 | 14 | 🟡 64% (9/14) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| deepseek/deepseek-chat | 14 | $0.01 | $0.00 | $0.04 | $0.21 |
| deepseek/deepseek-reasoner | 14 | $0.02 | $0.00 | $0.04 | $0.25 |
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 14 | $0.06 | $0.02 | $0.11 | $0.78 |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 14 | $0.26 | $0.10 | $0.48 | $3.65 |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 14 | $0.19 | $0.07 | $0.29 | $2.73 |
| gemini/gemini-3-flash-preview | 12 | $0.05 | $0.02 | $0.12 | $0.63 |
| gemini/gemini-3-pro-preview | 11 | $0.19 | $0.03 | $0.31 | $2.04 |
| gpt-5.2 | 14 | $0.11 | $0.02 | $0.24 | $1.51 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek/deepseek-chat | 122.9 | 19.4 | 210.1 | 129.5 | 210.1 |
| deepseek/deepseek-reasoner | 235.2 | 7.8 | 567.0 | 253.2 | 567.0 |
| eu.anthropic.claude-haiku-4-5-20251001-v1:0 | 26.7 | 4.9 | 52.2 | 28.0 | 52.2 |
| eu.anthropic.claude-opus-4-5-20251101-v1:0 | 48.5 | 6.0 | 194.7 | 39.0 | 194.7 |
| eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | 34.5 | 5.0 | 57.7 | 36.0 | 57.7 |
| gemini/gemini-3-flash-preview | 99.3 | 19.6 | 452.8 | 68.4 | 452.8 |
| gemini/gemini-3-pro-preview | 702.3 | 10.4 | 1523.9 | 883.4 | 1523.9 |
| gpt-5.2 | 39.3 | 12.0 | 73.8 | 39.4 | 73.8 |
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek/deepseek-chat | deepseek/deepseek-reasoner | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gemini/gemini-3-flash-preview | gemini/gemini-3-pro-preview | gpt-5.2 | Warnings |
|---|---|---|---|---|---|---|---|---|---|
| benchmark | 🟡 80% (⅘) | 🟡 60% (⅗) | 🟡 80% (⅘) | 🟡 60% (⅗) | 🟡 40% (⅖) | 🟡 80% (⅘) | 🟡 80% (⅘) | 🟡 60% (⅗) | |
| context_window | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | |
| counting | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| datetime | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 33% (⅓) | 🟡 67% (⅔) | 🟡 33% (⅓) | 🟡 67% (⅔) | |
| easy | 🟡 57% (4/7) | 🟡 71% (5/7) | 🟡 71% (5/7) | 🟡 86% (6/7) | 🟡 86% (6/7) | 🟡 57% (4/7) | 🟡 29% (2/7) | 🟡 57% (4/7) | |
| grafana-dashboard | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| hard | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| kubernetes | 🟡 86% (6/7) | 🟡 71% (5/7) | 🟡 71% (5/7) | 🟡 86% (6/7) | 🟡 86% (6/7) | 🟡 86% (6/7) | 🟡 43% (3/7) | 🟡 86% (6/7) | |
| logs | 🟡 50% (2/4) | 🟡 50% (2/4) | 🟡 50% (2/4) | 🟡 50% (2/4) | 🔴 0% (0/4) | 🟡 75% (¾) | 🟡 50% (2/4) | 🟡 50% (2/4) | |
| loki | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | |
| medium | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟡 50% (3/6) | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟡 67% (4/6) | |
| metrics | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| network | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| one-test | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| port-forward | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | |
| question-answer | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| regression | 🟡 67% (6/9) | 🟡 78% (7/9) | 🟡 78% (7/9) | 🟡 89% (8/9) | 🟡 89% (8/9) | 🟡 67% (6/9) | 🟡 33% (3/9) | 🟡 67% (6/9) | |
| runbooks | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | |
| Overall | 🟡 71% (10/14) | 🟡 71% (10/14) | 🟡 79% (11/14) | 🟡 79% (11/14) | 🟡 71% (10/14) | 🟡 71% (10/14) | 🟡 50% (7/14) | 🟡 64% (9/14) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
| Eval ID | deepseek/deepseek-chat | deepseek/deepseek-reasoner | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gemini/gemini-3-flash-preview | gemini/gemini-3-pro-preview | gpt-5.2 |
|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 |
| 108_logs_nearby_lines 🔗 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 |
| 111_pod_names_contain_service 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 12_job_crashing 🔗 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 |
| 162_get_runbooks 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 24_misconfigured_pvc 🔗 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 |
| 43_current_datetime_from_prompt 🔗 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 |
| 61_exact_match_counting 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 |
| 73a_time_window_anomaly 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 |
| 73b_time_window_anomaly 🔗 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 |
| 96_no_matching_runbook 🔗 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 🔴 |
| SUMMARY | 🟡 71% (10/14) | 🟡 71% (10/14) | 🟡 79% (11/14) | 🟡 79% (11/14) | 🟡 71% (10/14) | 🟡 71% (10/14) | 🟡 50% (7/14) | 🟡 64% (9/14) |
Detailed Raw Results¶
| Eval ID | deepseek/deepseek-chat | deepseek/deepseek-reasoner | eu.anthropic.claude-haiku-4-5-20251001-v1:0 | eu.anthropic.claude-opus-4-5-20251101-v1:0 | eu.anthropic.claude-sonnet-4-5-20250929-v1:0 | gemini/gemini-3-flash-preview | gemini/gemini-3-pro-preview | gpt-5.2 |
|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (1/1) / ⏱️ 100.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 197.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 26.1s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 29.8s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 27.9s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 68.4s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 284.7s | 🟢 100% (1/1) / ⏱️ 50.5s / 💰 $0.11 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🔴 0% (0/1) / ⏱️ 156.6s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 361.8s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 28.7s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 194.7s / 💰 $0.40 | 🔴 0% (0/1) / ⏱️ 40.1s / 💰 $0.18 | 🔴 0% (0/1) / ⏱️ 258.0s | 🔴 0% (0/1) / ⏱️ 340.5s / 💰 $0.31 | 🔴 0% (0/1) / ⏱️ 39.4s / 💰 $0.14 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/1) / ⏱️ 210.1s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 567.0s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 33.6s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 44.0s / 💰 $0.28 | 🔴 0% (0/1) / ⏱️ 44.2s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 72.5s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 912.4s / 💰 $0.28 | 🔴 0% (0/1) / ⏱️ 38.9s / 💰 $0.09 |
| 111_pod_names_contain_service 🔗 | 🟢 100% (1/1) / ⏱️ 140.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 260.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 28.0s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 27.5s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 34.4s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 51.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 1415.8s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 41.2s / 💰 $0.09 |
| 12_job_crashing 🔗 | 🔴 0% (0/1) / ⏱️ 119.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 317.3s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 33.0s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 39.0s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 38.7s / 💰 $0.19 | 🔴 0% (0/1) / ⏱️ 452.8s | 🟢 100% (1/1) / ⏱️ 1523.9s / 💰 $0.23 | 🔴 0% (0/1) / ⏱️ 33.9s / 💰 $0.07 |
| 162_get_runbooks 🔗 | 🟢 100% (1/1) / ⏱️ 119.2s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 253.2s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 33.2s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 50.3s / 💰 $0.36 | 🟢 100% (1/1) / ⏱️ 43.2s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 54.1s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 626.3s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 52.6s / 💰 $0.16 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 100% (1/1) / ⏱️ 129.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 255.1s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 38.3s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 50.4s / 💰 $0.32 | 🟢 100% (1/1) / ⏱️ 42.9s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 71.6s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 10.4s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 34.7s / 💰 $0.08 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (1/1) / ⏱️ 63.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 138.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 26.5s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 29.9s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 22.0s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 75.8s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 911.5s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 25.1s / 💰 $0.09 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (1/1) / ⏱️ 102.6s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 33.9s / 💰 $0.00 | 🔴 0% (0/1) / ⏱️ 4.9s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 33.0s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 34.9s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 35.5s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 1110.0s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 36.0s / 💰 $0.07 |
| 43_current_datetime_from_prompt 🔗 | 🔴 0% (0/1) / ⏱️ 19.4s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 7.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 6.5s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 6.0s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 5.0s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 19.6s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 44.1s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 12.0s / 💰 $0.02 |
| 61_exact_match_counting 🔗 | 🟢 100% (1/1) / ⏱️ 70.1s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 86.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 10.8s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 17.2s / 💰 $0.16 | 🟢 100% (1/1) / ⏱️ 22.2s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 31.1s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 284.4s | 🟢 100% (1/1) / ⏱️ 16.6s / 💰 $0.07 |
| 73a_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 174.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 201.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 27.6s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 33.8s / 💰 $0.22 | 🔴 0% (0/1) / ⏱️ 33.4s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 57.4s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 1303.9s | 🟢 100% (1/1) / ⏱️ 51.4s / 💰 $0.17 |
| 73b_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 153.1s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 246.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 24.9s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 40.0s / 💰 $0.24 | 🔴 0% (0/1) / ⏱️ 36.0s / 💰 $0.29 | 🟢 100% (1/1) / ⏱️ 36.7s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 180.9s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 44.5s / 💰 $0.11 |
| 96_no_matching_runbook 🔗 | 🟢 100% (1/1) / ⏱️ 160.5s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 366.0s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 52.2s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 83.8s / 💰 $0.48 | 🟢 100% (1/1) / ⏱️ 57.7s / 💰 $0.29 | 🔴 0% (0/1) / ⏱️ 105.7s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 883.4s / 💰 $0.29 | 🔴 0% (0/1) / ⏱️ 73.8s / 💰 $0.24 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-21401358384.