⚡ May 31, 2026¶
Generated: 2026-05-31 03:51 UTC
Total Duration: 30m 32s
Iterations: 1
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 14 | 3 | 0 | 17 | 🟡 82% (14/17) |
| deepseek-v3.2-chat | 15 | 2 | 0 | 17 | 🟡 88% (15/17) |
| gemini-3.1-pro-preview | 15 | 2 | 0 | 17 | 🟡 88% (15/17) |
| gpt-5.3-codex | 8 | 9 | 0 | 17 | 🟡 47% (8/17) |
| gpt-5.4 | 16 | 1 | 0 | 17 | 🟡 94% (16/17) |
| haiku-4.5 | 13 | 4 | 0 | 17 | 🟡 76% (13/17) |
| opus-4.6 | 15 | 2 | 0 | 17 | 🟡 88% (15/17) |
| opus-4.7 | 15 | 2 | 0 | 17 | 🟡 88% (15/17) |
| qwen-next-80B-instruct | 8 | 9 | 0 | 17 | 🟡 47% (8/17) |
| qwen-next-80B-thinking | 7 | 10 | 0 | 17 | 🟡 41% (7/17) |
| sonnet-4.6 | 16 | 1 | 0 | 17 | 🟡 94% (16/17) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 17 | $0.01 | $0.00 | $0.03 | $0.17 |
| deepseek-v3.2-chat | 17 | $0.01 | $0.00 | $0.02 | $0.15 |
| gemini-3.1-pro-preview | 17 | $0.13 | $0.03 | $0.37 | $2.20 |
| gpt-5.3-codex | 17 | $0.02 | $0.00 | $0.06 | $0.42 |
| gpt-5.4 | 17 | $0.08 | $0.01 | $0.18 | $1.42 |
| haiku-4.5 | 17 | $0.06 | $0.02 | $0.11 | $0.95 |
| opus-4.6 | 17 | $0.32 | $0.12 | $0.56 | $5.39 |
| opus-4.7 | 17 | $0.32 | $0.09 | $0.85 | $5.43 |
| qwen-next-80B-instruct | 17 | $0.03 | $0.00 | $0.06 | $0.49 |
| qwen-next-80B-thinking | 17 | $0.02 | $0.00 | $0.06 | $0.37 |
| sonnet-4.6 | 17 | $0.17 | $0.07 | $0.26 | $2.88 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 35.9 | 6.6 | 106.0 | 28.5 | 106.0 |
| deepseek-v3.2-chat | 25.4 | 5.4 | 51.8 | 23.9 | 51.8 |
| gemini-3.1-pro-preview | 35.8 | 11.6 | 96.9 | 31.5 | 96.9 |
| gpt-5.3-codex | 11.4 | 3.8 | 18.9 | 12.6 | 18.9 |
| gpt-5.4 | 30.4 | 7.2 | 61.1 | 28.7 | 61.1 |
| haiku-4.5 | 25.2 | 6.4 | 49.3 | 23.3 | 49.3 |
| opus-4.6 | 62.8 | 6.6 | 321.2 | 42.9 | 321.2 |
| opus-4.7 | 38.0 | 10.4 | 96.5 | 34.4 | 96.5 |
| qwen-next-80B-instruct | 32.3 | 5.6 | 66.2 | 33.7 | 66.2 |
| qwen-next-80B-thinking | 45.1 | 6.6 | 112.9 | 23.4 | 112.9 |
| sonnet-4.6 | 35.6 | 5.6 | 55.5 | 41.4 | 55.5 |
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | opus-4.7 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 | Warnings |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| benchmark | 🟡 67% (4/6) | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟡 17% (⅙) | 🟡 83% (⅚) | 🟡 33% (2/6) | 🟡 67% (4/6) | 🟡 67% (4/6) | 🟡 17% (⅙) | 🟡 33% (2/6) | 🟡 83% (⅚) | |
| context_window | 🟡 50% (½) | 🟢 100% (2/2) | 🟡 50% (½) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | |
| counting | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | |
| datetime | 🟡 33% (⅓) | 🟡 67% (⅔) | 🟡 67% (⅔) | 🟡 33% (⅓) | 🟢 100% (3/3) | 🟡 33% (⅓) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 33% (⅓) | 🟡 33% (⅓) | 🟢 100% (3/3) | |
| easy | 🟡 88% (⅞) | 🟡 88% (⅞) | 🟢 100% (8/8) | 🟡 62% (⅝) | 🟢 100% (8/8) | 🟢 100% (8/8) | 🟢 100% (8/8) | 🟢 100% (8/8) | 🟡 50% (4/8) | 🟡 50% (4/8) | 🟢 100% (8/8) | |
| grafana | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| hard | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🔴 0% (0/2) | 🟡 50% (½) | 🔴 0% (0/2) | 🟡 50% (½) | |
| kubernetes | 🟢 100% (9/9) | 🟢 100% (9/9) | 🟢 100% (9/9) | 🟡 56% (5/9) | 🟢 100% (9/9) | 🟡 89% (8/9) | 🟢 100% (9/9) | 🟡 89% (8/9) | 🟡 67% (6/9) | 🟡 44% (4/9) | 🟢 100% (9/9) | |
| logs | 🟡 67% (4/6) | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟡 33% (2/6) | 🟡 83% (⅚) | 🟡 33% (2/6) | 🟡 83% (⅚) | 🟡 83% (⅚) | 🟡 17% (⅙) | 🟡 50% (3/6) | 🟡 83% (⅚) | |
| loki | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | |
| medium | 🟡 83% (⅚) | 🟢 100% (6/6) | 🟡 83% (⅚) | 🟡 17% (⅙) | 🟢 100% (6/6) | 🟡 50% (3/6) | 🟡 83% (⅚) | 🟢 100% (6/6) | 🟡 33% (2/6) | 🟡 33% (2/6) | 🟢 100% (6/6) | |
| metrics | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| network | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| one-test | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| port-forward | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟡 33% (⅓) | 🟡 67% (⅔) | 🟢 100% (3/3) | |
| question-answer | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| regression | 🟡 91% (10/11) | 🟡 91% (10/11) | 🟢 100% (11/11) | 🟡 64% (7/11) | 🟢 100% (11/11) | 🟢 100% (11/11) | 🟢 100% (11/11) | 🟢 100% (11/11) | 🟡 64% (7/11) | 🟡 45% (5/11) | 🟢 100% (11/11) | |
| skills | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| Overall | 🟡 82% (14/17) | 🟡 88% (15/17) | 🟡 88% (15/17) | 🟡 47% (8/17) | 🟡 94% (16/17) | 🟡 76% (13/17) | 🟡 88% (15/17) | 🟡 88% (15/17) | 🟡 47% (8/17) | 🟡 41% (7/17) | 🟡 94% (16/17) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
Detailed Raw Results¶
| Eval ID | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | opus-4.7 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (1/1) / ⏱️ 28.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 19.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 31.6s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 4.6s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 27.6s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 23.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 49.4s / 💰 $0.33 | 🟢 100% (1/1) / ⏱️ 29.6s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 28.3s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 9.1s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 38.8s / 💰 $0.19 |
| 100a_loki_historical_logs 🔗 | 🟢 100% (1/1) / ⏱️ 81.3s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 35.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 96.9s / 💰 $0.37 | 🔴 0% (0/1) / ⏱️ 10.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 49.0s / 💰 $0.10 | 🔴 0% (0/1) / ⏱️ 36.3s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 98.1s / 💰 $0.54 | 🟢 100% (1/1) / ⏱️ 35.4s / 💰 $0.31 | 🔴 0% (0/1) / ⏱️ 33.7s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 112.9s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 46.2s / 💰 $0.19 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟢 100% (1/1) / ⏱️ 106.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 42.1s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 33.2s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 12.0s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 34.9s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 27.8s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 90.0s / 💰 $0.46 | 🟢 100% (1/1) / ⏱️ 35.5s / 💰 $0.35 | 🔴 0% (0/1) / ⏱️ 34.5s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 54.5s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 41.5s / 💰 $0.18 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/1) / ⏱️ 39.4s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 33.2s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 43.1s / 💰 $0.15 | 🔴 0% (0/1) / ⏱️ 18.5s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 39.1s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 32.1s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 60.9s / 💰 $0.38 | 🔴 0% (0/1) / ⏱️ 56.6s / 💰 $0.45 | 🔴 0% (0/1) / ⏱️ 48.0s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 90.8s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 43.7s / 💰 $0.19 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 100% (1/1) / ⏱️ 20.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 14.5s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 21.3s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 13.2s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 18.2s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 17.6s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 26.1s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 33.3s / 💰 $0.24 | 🟢 100% (1/1) / ⏱️ 21.8s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 79.0s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 21.0s / 💰 $0.12 |
| 12_job_crashing 🔗 | 🟢 100% (1/1) / ⏱️ 32.3s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 29.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 73.3s / 💰 $0.25 | 🔴 0% (0/1) / ⏱️ 3.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 35.1s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 30.1s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 42.9s / 💰 $0.31 | 🟢 100% (1/1) / ⏱️ 49.1s / 💰 $0.32 | 🔴 0% (0/1) / ⏱️ 29.2s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 18.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 47.1s / 💰 $0.22 |
| 176_network_policy_blocking_traffic_no_skills 🔗 | 🟢 100% (1/1) / ⏱️ 43.0s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 45.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 31.5s / 💰 $0.12 | 🔴 0% (0/1) / ⏱️ 13.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 31.0s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 49.3s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 57.2s / 💰 $0.34 | 🟢 100% (1/1) / ⏱️ 27.4s / 💰 $0.34 | 🔴 0% (0/1) / ⏱️ 66.2s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 102.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 44.6s / 💰 $0.21 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (1/1) / ⏱️ 18.0s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 12.0s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 32.9s / 💰 $0.16 | 🟢 100% (1/1) / ⏱️ 12.6s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 20.4s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 21.0s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 23.7s / 💰 $0.24 | 🔴 0% (0/1) / ⏱️ 96.5s / 💰 $0.85 | 🟢 100% (1/1) / ⏱️ 17.2s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 23.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 20.0s / 💰 $0.13 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 100% (1/1) / ⏱️ 15.2s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 17.3s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 38.3s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 17.5s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 21.1s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 18.7s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 25.9s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 29.8s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 37.8s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 53.1s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 25.7s / 💰 $0.16 |
| 243_pod_names_contain_service 🔗 | 🟢 100% (1/1) / ⏱️ 21.6s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 23.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 30.9s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 4.7s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 28.7s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 30.3s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 37.5s / 💰 $0.25 | 🟢 100% (1/1) / ⏱️ 38.5s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 38.3s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 8.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 41.4s / 💰 $0.19 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (1/1) / ⏱️ 24.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 23.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 31.3s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 14.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 28.5s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 24.4s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 40.8s / 💰 $0.29 | 🟢 100% (1/1) / ⏱️ 29.2s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 46.0s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 6.6s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 41.1s / 💰 $0.19 |
| 43_current_datetime_from_prompt 🔗 | 🔴 0% (0/1) / ⏱️ 6.6s / 💰 $0.00 | 🔴 0% (0/1) / ⏱️ 5.4s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 14.4s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 4.6s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 7.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 6.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 6.6s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 10.4s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 5.6s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 19.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 5.6s / 💰 $0.07 |
| 51_logs_summarize_errors 🔗 | 🟢 100% (1/1) / ⏱️ 15.0s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 14.1s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 21.1s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 18.9s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 22.1s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 23.0s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 27.7s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 28.1s / 💰 $0.31 | 🟢 100% (1/1) / ⏱️ 22.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 61.8s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 27.2s / 💰 $0.12 |
| 61_exact_match_counting 🔗 | 🟢 100% (1/1) / ⏱️ 7.9s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 5.6s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 11.6s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 8.7s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 10.0s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 10.8s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 12.8s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 13.1s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 13.0s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 18.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 11.2s / 💰 $0.09 |
| 73a_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 40.7s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 30.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 26.6s / 💰 $0.08 | 🔴 0% (0/1) / ⏱️ 12.7s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 33.1s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 8.1s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 76.3s / 💰 $0.40 | 🟢 100% (1/1) / ⏱️ 37.5s / 💰 $0.27 | 🔴 0% (0/1) / ⏱️ 24.8s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 14.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 47.8s / 💰 $0.19 |
| 73b_time_window_anomaly 🔗 | 🔴 0% (0/1) / ⏱️ 39.3s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 27.2s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 27.3s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 10.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 50.6s / 💰 $0.12 | 🔴 0% (0/1) / ⏱️ 23.0s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 70.3s / 💰 $0.39 | 🟢 100% (1/1) / ⏱️ 34.4s / 💰 $0.23 | 🔴 0% (0/1) / ⏱️ 37.7s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 15.1s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 47.2s / 💰 $0.20 |
| 96_no_matching_skill 🔗 | 🟢 100% (1/1) / ⏱️ 69.7s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 51.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 42.7s / 💰 $0.16 | 🔴 0% (0/1) / ⏱️ 13.3s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 61.1s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 46.7s / 💰 $0.10 | 🔴 0% (0/1) / ⏱️ 321.2s / 💰 $0.56 | 🟢 100% (1/1) / ⏱️ 61.4s / 💰 $0.62 | 🔴 0% (0/1) / ⏱️ 44.4s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 77.6s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 55.5s / 💰 $0.26 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-26701915314.