⚡ May 24, 2026¶
Generated: 2026-05-24 13:28 UTC
Total Duration: 2h 55m 18s
Iterations: 5
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 67 | 18 | 0 | 85 | 🟡 79% (67/85) |
| deepseek-v3.2-chat | 67 | 18 | 0 | 85 | 🟡 79% (67/85) |
| gemini-3.1-pro-preview | 68 | 17 | 0 | 85 | 🟡 80% (68/85) |
| gpt-5.3-codex | 48 | 37 | 0 | 85 | 🟡 56% (48/85) |
| gpt-5.4 | 68 | 17 | 0 | 85 | 🟡 80% (68/85) |
| haiku-4.5 | 60 | 25 | 0 | 85 | 🟡 71% (60/85) |
| opus-4.6 | 75 | 10 | 0 | 85 | 🟡 88% (75/85) |
| opus-4.7 | 73 | 12 | 0 | 85 | 🟡 86% (73/85) |
| qwen-next-80B-instruct | 43 | 42 | 0 | 85 | 🟡 51% (43/85) |
| qwen-next-80B-thinking | 29 | 56 | 0 | 85 | 🟡 34% (29/85) |
| sonnet-4.6 | 74 | 11 | 0 | 85 | 🟡 87% (74/85) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 85 | $0.01 | $0.00 | $0.03 | $0.75 |
| deepseek-v3.2-chat | 85 | $0.01 | $0.00 | $0.06 | $0.84 |
| gemini-3.1-pro-preview | 80 | $0.14 | $0.03 | $0.70 | $11.17 |
| gpt-5.3-codex | 85 | $0.03 | $0.00 | $0.07 | $2.39 |
| gpt-5.4 | 85 | $0.08 | $0.02 | $0.17 | $6.66 |
| haiku-4.5 | 85 | $0.06 | $0.02 | $0.13 | $5.01 |
| opus-4.6 | 85 | $0.42 | $0.12 | $4.21 | $35.51 |
| opus-4.7 | 85 | $0.31 | $0.06 | $0.84 | $26.04 |
| qwen-next-80B-instruct | 85 | $0.03 | $0.00 | $0.10 | $2.65 |
| qwen-next-80B-thinking | 85 | $0.03 | $0.00 | $0.11 | $2.24 |
| sonnet-4.6 | 85 | $0.18 | $0.07 | $0.35 | $15.03 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 37.2 | 3.6 | 141.8 | 30.7 | 103.3 |
| deepseek-v3.2-chat | 35.9 | 4.2 | 275.6 | 27.7 | 85.3 |
| gemini-3.1-pro-preview | 55.4 | 12.3 | 589.1 | 33.9 | 121.1 |
| gpt-5.3-codex | 15.0 | 3.6 | 28.7 | 14.2 | 23.5 |
| gpt-5.4 | 31.8 | 7.3 | 83.8 | 32.4 | 50.7 |
| haiku-4.5 | 25.7 | 4.6 | 55.9 | 25.6 | 42.2 |
| opus-4.6 | 70.6 | 5.9 | 744.1 | 44.2 | 173.4 |
| opus-4.7 | 42.4 | 8.7 | 206.5 | 36.2 | 81.7 |
| qwen-next-80B-instruct | 30.7 | 4.3 | 80.5 | 29.7 | 56.4 |
| qwen-next-80B-thinking | 51.0 | 4.9 | 709.6 | 26.3 | 102.4 |
| sonnet-4.6 | 37.2 | 4.0 | 78.0 | 40.8 | 64.6 |
⚠️ Note: 6 test(s) excluded from latency calculations due to throttling/timeout errors (gemini-3.1-pro-preview: 3, gpt-5.3-codex: 2, qwen-next-80B-thinking: 1)
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | opus-4.7 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 | Warnings |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| benchmark | 🟡 57% (17/30) | 🟡 70% (21/30) | 🟡 57% (17/30) | 🟡 30% (9/30) | 🟡 70% (21/30) | 🟡 40% (12/30) | 🟡 80% (24/30) | 🟡 73% (22/30) | 🟡 40% (12/30) | 🟡 10% (3/30) | 🟡 77% (23/30) | |
| context_window | 🟡 50% (5/10) | 🟡 90% (9/10) | 🟡 70% (7/10) | 🟡 30% (3/10) | 🟢 100% (10/10) | 🟡 20% (2/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 40% (4/10) | 🔴 0% (0/10) | 🟢 100% (10/10) | |
| counting | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 90% (9/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 50% (5/10) | 🟡 80% (8/10) | 🟢 100% (10/10) | |
| datetime | 🟡 60% (9/15) | 🟡 60% (9/15) | 🟡 80% (12/15) | 🟡 53% (8/15) | 🟢 100% (15/15) | 🟡 47% (7/15) | 🟢 100% (15/15) | 🟢 100% (15/15) | 🟡 60% (9/15) | 🟡 33% (5/15) | 🟢 100% (15/15) | |
| easy | 🟡 88% (35/40) | 🟡 78% (31/40) | 🟡 90% (36/40) | 🟡 68% (27/40) | 🟡 80% (32/40) | 🟡 82% (33/40) | 🟡 90% (36/40) | 🟡 90% (36/40) | 🟡 52% (21/40) | 🟡 45% (18/40) | 🟡 90% (36/40) | |
| grafana | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 60% (⅗) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 20% (⅕) | 🟢 100% (5/5) | |
| hard | 🟡 60% (6/10) | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 30% (3/10) | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 80% (8/10) | 🟡 50% (5/10) | 🟡 50% (5/10) | 🟡 10% (1/10) | 🟡 70% (7/10) | |
| kubernetes | 🟡 82% (37/45) | 🟡 84% (38/45) | 🟡 84% (38/45) | 🟡 49% (22/45) | 🟡 80% (36/45) | 🟡 73% (33/45) | 🟡 84% (38/45) | 🟡 84% (38/45) | 🟡 60% (27/45) | 🟡 36% (16/45) | 🟡 82% (37/45) | |
| logs | 🟡 43% (13/30) | 🟡 57% (17/30) | 🟡 50% (15/30) | 🟡 27% (8/30) | 🟡 57% (17/30) | 🟡 27% (8/30) | 🟡 70% (21/30) | 🟡 60% (18/30) | 🟡 33% (10/30) | 🟡 10% (3/30) | 🟡 63% (19/30) | |
| loki | 🟡 20% (2/10) | 🟡 30% (3/10) | 🟡 30% (3/10) | 🔴 0% (0/10) | 🟡 20% (2/10) | 🟡 10% (1/10) | 🟡 30% (3/10) | 🟡 30% (3/10) | 🟡 10% (1/10) | 🟡 20% (2/10) | 🟡 20% (2/10) | |
| medium | 🟡 70% (21/30) | 🟡 87% (26/30) | 🟡 73% (22/30) | 🟡 47% (14/30) | 🟡 87% (26/30) | 🟡 57% (17/30) | 🟡 87% (26/30) | 🟡 90% (27/30) | 🟡 40% (12/30) | 🟡 20% (6/30) | 🟡 87% (26/30) | |
| metrics | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 60% (⅗) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 20% (⅕) | 🟢 100% (5/5) | |
| network | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 20% (⅕) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 40% (⅖) | 🟢 100% (5/5) | 🟢 100% (5/5) | |
| one-test | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 80% (⅘) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🔴 0% (0/5) | 🟢 100% (5/5) | |
| port-forward | 🟡 47% (7/15) | 🟡 53% (8/15) | 🟡 53% (8/15) | 🟡 20% (3/15) | 🟡 47% (7/15) | 🟡 40% (6/15) | 🟡 53% (8/15) | 🟡 53% (8/15) | 🟡 40% (6/15) | 🟡 20% (3/15) | 🟡 47% (7/15) | |
| question-answer | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 60% (⅗) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 20% (⅕) | 🟢 100% (5/5) | |
| regression | 🟡 91% (50/55) | 🟡 84% (46/55) | 🟡 93% (51/55) | 🟡 71% (39/55) | 🟡 85% (47/55) | 🟡 87% (48/55) | 🟡 93% (51/55) | 🟡 93% (51/55) | 🟡 56% (31/55) | 🟡 47% (26/55) | 🟡 93% (51/55) | |
| skills | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 60% (⅗) | 🟡 60% (⅗) | 🟢 100% (5/5) | 🟢 100% (5/5) | 🟡 80% (⅘) | 🟢 100% (5/5) | 🟡 40% (⅖) | 🟡 20% (⅕) | 🟢 100% (5/5) | |
| Overall | 🟡 79% (67/85) | 🟡 79% (67/85) | 🟡 80% (68/85) | 🟡 56% (48/85) | 🟡 80% (68/85) | 🟡 71% (60/85) | 🟡 88% (75/85) | 🟡 86% (73/85) | 🟡 51% (43/85) | 🟡 34% (29/85) | 🟡 87% (74/85) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
Detailed Raw Results¶
| Eval ID | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | opus-4.7 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (5/5) / ⏱️ 27.8s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 26.5s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 33.2s / 💰 $0.10 | 🟡 80% (⅘) / ⏱️ 17.5s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 27.3s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 27.3s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 37.7s / 💰 $0.27 | 🟢 100% (5/5) / ⏱️ 41.6s / 💰 $0.26 | 🟢 100% (5/5) / ⏱️ 33.0s / 💰 $0.04 | 🔴 0% (0/5) / ⏱️ 11.8s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 38.8s / 💰 $0.18 |
| 100a_loki_historical_logs 🔗 | 🟡 20% (⅕) / ⏱️ 104.9s / 💰 $0.02 | 🟡 40% (⅖) / ⏱️ 106.0s / 💰 $0.03 | 🟡 40% (⅖) / ⏱️ 323.4s / 💰 $0.28 | 🔴 0% (0/5) / ⏱️ 25.2s / 💰 $0.02 | 🟡 20% (⅕) / ⏱️ 45.9s / 💰 $0.10 | 🔴 0% (0/5) / ⏱️ 35.1s / 💰 $0.08 | 🟡 40% (⅖) / ⏱️ 135.4s / 💰 $0.68 | 🟡 40% (⅖) / ⏱️ 93.8s / 💰 $0.71 | 🟡 20% (⅕) / ⏱️ 34.6s / 💰 $0.04 | 🟡 20% (⅕) / ⏱️ 74.7s / 💰 $0.05 | 🟡 20% (⅕) / ⏱️ 60.4s / 💰 $0.26 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟡 20% (⅕) / ⏱️ 87.1s / 💰 $0.01 | 🟡 20% (⅕) / ⏱️ 139.4s / 💰 $0.03 | 🟡 20% (⅕) / ⏱️ 476.9s / 💰 $0.10 | 🔴 0% (0/5) / ⏱️ 12.8s / 💰 $0.02 | 🟡 20% (⅕) / ⏱️ 44.1s / 💰 $0.09 | 🟡 20% (⅕) / ⏱️ 31.4s / 💰 $0.06 | 🟡 20% (⅕) / ⏱️ 385.1s / 💰 $2.03 | 🟡 20% (⅕) / ⏱️ 81.8s / 💰 $0.45 | 🔴 0% (0/5) / ⏱️ 31.9s / 💰 $0.03 | 🟡 20% (⅕) / ⏱️ 189.6s / 💰 $0.05 | 🟡 20% (⅕) / ⏱️ 50.7s / 💰 $0.21 |
| 108_logs_nearby_lines 🔗 | 🟡 20% (⅕) / ⏱️ 44.6s / 💰 $0.01 | 🔴 0% (0/5) / ⏱️ 34.3s / 💰 $0.01 | 🔴 0% (0/5) / ⏱️ 69.4s / 💰 $0.23 | 🔴 0% (0/5) / ⏱️ 20.0s / 💰 $0.04 | 🔴 0% (0/5) / ⏱️ 40.9s / 💰 $0.11 | 🔴 0% (0/5) / ⏱️ 41.0s / 💰 $0.09 | 🟡 60% (⅗) / ⏱️ 69.0s / 💰 $0.45 | 🔴 0% (0/5) / ⏱️ 60.5s / 💰 $0.39 | 🔴 0% (0/5) / ⏱️ 66.7s / 💰 $0.09 | 🔴 0% (0/5) / ⏱️ 63.0s / 💰 $0.04 | 🟡 40% (⅖) / ⏱️ 44.6s / 💰 $0.20 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 100% (5/5) / ⏱️ 20.7s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 15.8s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 28.6s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 13.2s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 21.5s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 21.1s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 25.1s / 💰 $0.21 | 🟢 100% (5/5) / ⏱️ 22.9s / 💰 $0.18 | 🟡 60% (⅗) / ⏱️ 20.4s / 💰 $0.02 | 🟡 80% (⅘) / ⏱️ 42.3s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 20.8s / 💰 $0.12 |
| 12_job_crashing 🔗 | 🟢 100% (5/5) / ⏱️ 35.4s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 34.9s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 64.4s / 💰 $0.19 | 🟢 100% (5/5) / ⏱️ 18.6s / 💰 $0.03 | 🟡 40% (⅖) / ⏱️ 36.3s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 31.2s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 48.9s / 💰 $0.32 | 🟢 100% (5/5) / ⏱️ 49.7s / 💰 $0.30 | 🔴 0% (0/5) / ⏱️ 26.8s / 💰 $0.02 | 🟡 40% (⅖) / ⏱️ 61.8s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 47.2s / 💰 $0.21 |
| 176_network_policy_blocking_traffic_no_skills 🔗 | 🟢 100% (5/5) / ⏱️ 46.8s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 40.2s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 40.7s / 💰 $0.15 | 🟡 20% (⅕) / ⏱️ 24.1s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 42.9s / 💰 $0.11 | 🟢 100% (5/5) / ⏱️ 39.1s / 💰 $0.08 | 🟢 100% (5/5) / ⏱️ 56.7s / 💰 $0.35 | 🟢 100% (5/5) / ⏱️ 40.6s / 💰 $0.42 | 🟡 40% (⅖) / ⏱️ 44.7s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 84.8s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 44.3s / 💰 $0.21 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (5/5) / ⏱️ 17.6s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 12.7s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 21.9s / 💰 $0.09 | 🟡 60% (⅗) / ⏱️ 12.6s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 17.4s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 23.2s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 23.2s / 💰 $0.24 | 🟢 100% (5/5) / ⏱️ 26.7s / 💰 $0.32 | 🟢 100% (5/5) / ⏱️ 16.1s / 💰 $0.01 | 🟡 20% (⅕) / ⏱️ 46.9s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 20.1s / 💰 $0.13 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 100% (5/5) / ⏱️ 16.7s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 14.7s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 31.8s / 💰 $0.08 | 🟡 80% (⅘) / ⏱️ 17.4s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 22.3s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 17.9s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 23.1s / 💰 $0.21 | 🟢 100% (5/5) / ⏱️ 38.7s / 💰 $0.23 | 🟢 100% (5/5) / ⏱️ 34.3s / 💰 $0.04 | 🟡 80% (⅘) / ⏱️ 306.5s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 22.2s / 💰 $0.13 |
| 243_pod_names_contain_service 🔗 | 🟢 100% (5/5) / ⏱️ 32.9s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 25.7s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 27.0s / 💰 $0.08 | 🟡 60% (⅗) / ⏱️ 16.6s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 31.3s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 27.8s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 38.3s / 💰 $0.26 | 🟢 100% (5/5) / ⏱️ 40.5s / 💰 $0.23 | 🟡 40% (⅖) / ⏱️ 36.8s / 💰 $0.04 | 🔴 0% (0/5) / ⏱️ 7.2s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 40.8s / 💰 $0.18 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (5/5) / ⏱️ 34.7s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 28.1s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 40.6s / 💰 $0.11 | 🟡 40% (⅖) / ⏱️ 14.2s / 💰 $0.02 | 🟡 80% (⅘) / ⏱️ 33.8s / 💰 $0.07 | 🟡 40% (⅖) / ⏱️ 14.5s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 43.4s / 💰 $0.30 | 🟢 100% (5/5) / ⏱️ 32.3s / 💰 $0.32 | 🟡 80% (⅘) / ⏱️ 40.0s / 💰 $0.04 | 🔴 0% (0/5) / ⏱️ 5.7s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 41.3s / 💰 $0.19 |
| 43_current_datetime_from_prompt 🔗 | 🟡 80% (⅘) / ⏱️ 5.9s / 💰 $0.00 | 🔴 0% (0/5) / ⏱️ 4.8s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 22.4s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 4.0s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 9.8s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 5.2s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 6.1s / 💰 $0.12 | 🟢 100% (5/5) / ⏱️ 12.1s / 💰 $0.10 | 🟢 100% (5/5) / ⏱️ 5.0s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 9.4s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 4.9s / 💰 $0.07 |
| 51_logs_summarize_errors 🔗 | 🟢 100% (5/5) / ⏱️ 14.9s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 13.2s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 25.3s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 14.4s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 37.0s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 18.6s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 24.7s / 💰 $0.20 | 🟢 100% (5/5) / ⏱️ 28.5s / 💰 $0.17 | 🟢 100% (5/5) / ⏱️ 21.9s / 💰 $0.02 | 🟡 20% (⅕) / ⏱️ 19.9s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 23.6s / 💰 $0.12 |
| 61_exact_match_counting 🔗 | 🟢 100% (5/5) / ⏱️ 7.8s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 8.4s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 15.4s / 💰 $0.04 | 🟢 100% (5/5) / ⏱️ 9.0s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 9.3s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 10.5s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 12.7s / 💰 $0.15 | 🟢 100% (5/5) / ⏱️ 15.7s / 💰 $0.09 | 🔴 0% (0/5) / ⏱️ 10.5s / 💰 $0.01 | 🟡 80% (⅘) / ⏱️ 26.1s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 10.9s / 💰 $0.09 |
| 73a_time_window_anomaly 🔗 | 🟡 40% (⅖) / ⏱️ 30.7s / 💰 $0.01 | 🟡 80% (⅘) / ⏱️ 28.0s / 💰 $0.01 | 🟡 80% (⅘) / ⏱️ 34.4s / 💰 $0.11 | 🟡 20% (⅕) / ⏱️ 14.4s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 35.4s / 💰 $0.10 | 🟡 20% (⅕) / ⏱️ 29.2s / 💰 $0.06 | 🟢 100% (5/5) / ⏱️ 70.1s / 💰 $0.38 | 🟢 100% (5/5) / ⏱️ 37.9s / 💰 $0.25 | 🟡 40% (⅖) / ⏱️ 28.0s / 💰 $0.02 | 🔴 0% (0/5) / ⏱️ 13.7s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 51.6s / 💰 $0.21 |
| 73b_time_window_anomaly 🔗 | 🟡 60% (⅗) / ⏱️ 40.0s / 💰 $0.01 | 🟢 100% (5/5) / ⏱️ 29.1s / 💰 $0.01 | 🟡 60% (⅗) / ⏱️ 43.4s / 💰 $0.14 | 🟡 40% (⅖) / ⏱️ 16.4s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 38.8s / 💰 $0.10 | 🟡 20% (⅕) / ⏱️ 22.8s / 💰 $0.05 | 🟢 100% (5/5) / ⏱️ 70.5s / 💰 $0.39 | 🟢 100% (5/5) / ⏱️ 36.0s / 💰 $0.26 | 🟡 40% (⅖) / ⏱️ 25.4s / 💰 $0.02 | 🔴 0% (0/5) / ⏱️ 11.4s / 💰 $0.00 | 🟢 100% (5/5) / ⏱️ 49.9s / 💰 $0.21 |
| 96_no_matching_skill 🔗 | 🟢 100% (5/5) / ⏱️ 63.6s / 💰 $0.02 | 🟢 100% (5/5) / ⏱️ 49.1s / 💰 $0.02 | 🟡 60% (⅗) / ⏱️ 81.6s / 💰 $0.33 | 🟡 60% (⅗) / ⏱️ 48.9s / 💰 $0.03 | 🟢 100% (5/5) / ⏱️ 46.8s / 💰 $0.15 | 🟢 100% (5/5) / ⏱️ 40.5s / 💰 $0.09 | 🟡 80% (⅘) / ⏱️ 129.5s / 💰 $0.55 | 🟢 100% (5/5) / ⏱️ 61.7s / 💰 $0.52 | 🟡 40% (⅖) / ⏱️ 44.9s / 💰 $0.05 | 🟡 20% (⅕) / ⏱️ 87.2s / 💰 $0.07 | 🟢 100% (5/5) / ⏱️ 59.9s / 💰 $0.26 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-26358788343.