Skip to content

⚡ May 24, 2026

Generated: 2026-05-24 13:28 UTC
Total Duration: 2h 55m 18s
Iterations: 5
Judge (classifier) model: gpt-4.1

Fast Benchmark

Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes

HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.

If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.

Model Accuracy Comparison

Model Pass Fail Skip/Error Total Success Rate
deepseek-r1-reasoner 67 18 0 85 🟡 79% (67/85)
deepseek-v3.2-chat 67 18 0 85 🟡 79% (67/85)
gemini-3.1-pro-preview 68 17 0 85 🟡 80% (68/85)
gpt-5.3-codex 48 37 0 85 🟡 56% (48/85)
gpt-5.4 68 17 0 85 🟡 80% (68/85)
haiku-4.5 60 25 0 85 🟡 71% (60/85)
opus-4.6 75 10 0 85 🟡 88% (75/85)
opus-4.7 73 12 0 85 🟡 86% (73/85)
qwen-next-80B-instruct 43 42 0 85 🟡 51% (43/85)
qwen-next-80B-thinking 29 56 0 85 🟡 34% (29/85)
sonnet-4.6 74 11 0 85 🟡 87% (74/85)

Model Cost Comparison

Model Tests Avg Cost Min Cost Max Cost Total Cost
deepseek-r1-reasoner 85 $0.01 $0.00 $0.03 $0.75
deepseek-v3.2-chat 85 $0.01 $0.00 $0.06 $0.84
gemini-3.1-pro-preview 80 $0.14 $0.03 $0.70 $11.17
gpt-5.3-codex 85 $0.03 $0.00 $0.07 $2.39
gpt-5.4 85 $0.08 $0.02 $0.17 $6.66
haiku-4.5 85 $0.06 $0.02 $0.13 $5.01
opus-4.6 85 $0.42 $0.12 $4.21 $35.51
opus-4.7 85 $0.31 $0.06 $0.84 $26.04
qwen-next-80B-instruct 85 $0.03 $0.00 $0.10 $2.65
qwen-next-80B-thinking 85 $0.03 $0.00 $0.11 $2.24
sonnet-4.6 85 $0.18 $0.07 $0.35 $15.03

Model Latency Comparison

Model Avg (s) Min (s) Max (s) P50 (s) P95 (s)
deepseek-r1-reasoner 37.2 3.6 141.8 30.7 103.3
deepseek-v3.2-chat 35.9 4.2 275.6 27.7 85.3
gemini-3.1-pro-preview 55.4 12.3 589.1 33.9 121.1
gpt-5.3-codex 15.0 3.6 28.7 14.2 23.5
gpt-5.4 31.8 7.3 83.8 32.4 50.7
haiku-4.5 25.7 4.6 55.9 25.6 42.2
opus-4.6 70.6 5.9 744.1 44.2 173.4
opus-4.7 42.4 8.7 206.5 36.2 81.7
qwen-next-80B-instruct 30.7 4.3 80.5 29.7 56.4
qwen-next-80B-thinking 51.0 4.9 709.6 26.3 102.4
sonnet-4.6 37.2 4.0 78.0 40.8 64.6

⚠️ Note: 6 test(s) excluded from latency calculations due to throttling/timeout errors (gemini-3.1-pro-preview: 3, gpt-5.3-codex: 2, qwen-next-80B-thinking: 1)

Performance by Tag

Success rate by test category and model:

Tag deepseek-r1-reasoner deepseek-v3.2-chat gemini-3.1-pro-preview gpt-5.3-codex gpt-5.4 haiku-4.5 opus-4.6 opus-4.7 qwen-next-80B-instruct qwen-next-80B-thinking sonnet-4.6 Warnings
benchmark 🟡 57% (17/30) 🟡 70% (21/30) 🟡 57% (17/30) 🟡 30% (9/30) 🟡 70% (21/30) 🟡 40% (12/30) 🟡 80% (24/30) 🟡 73% (22/30) 🟡 40% (12/30) 🟡 10% (3/30) 🟡 77% (23/30)
context_window 🟡 50% (5/10) 🟡 90% (9/10) 🟡 70% (7/10) 🟡 30% (3/10) 🟢 100% (10/10) 🟡 20% (2/10) 🟢 100% (10/10) 🟢 100% (10/10) 🟡 40% (4/10) 🔴 0% (0/10) 🟢 100% (10/10)
counting 🟢 100% (10/10) 🟢 100% (10/10) 🟢 100% (10/10) 🟡 90% (9/10) 🟢 100% (10/10) 🟢 100% (10/10) 🟢 100% (10/10) 🟢 100% (10/10) 🟡 50% (5/10) 🟡 80% (8/10) 🟢 100% (10/10)
datetime 🟡 60% (9/15) 🟡 60% (9/15) 🟡 80% (12/15) 🟡 53% (8/15) 🟢 100% (15/15) 🟡 47% (7/15) 🟢 100% (15/15) 🟢 100% (15/15) 🟡 60% (9/15) 🟡 33% (5/15) 🟢 100% (15/15)
easy 🟡 88% (35/40) 🟡 78% (31/40) 🟡 90% (36/40) 🟡 68% (27/40) 🟡 80% (32/40) 🟡 82% (33/40) 🟡 90% (36/40) 🟡 90% (36/40) 🟡 52% (21/40) 🟡 45% (18/40) 🟡 90% (36/40)
grafana 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 60% (⅗) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 20% (⅕) 🟢 100% (5/5)
hard 🟡 60% (6/10) 🟡 50% (5/10) 🟡 50% (5/10) 🟡 30% (3/10) 🟡 50% (5/10) 🟡 50% (5/10) 🟡 80% (8/10) 🟡 50% (5/10) 🟡 50% (5/10) 🟡 10% (1/10) 🟡 70% (7/10)
kubernetes 🟡 82% (37/45) 🟡 84% (38/45) 🟡 84% (38/45) 🟡 49% (22/45) 🟡 80% (36/45) 🟡 73% (33/45) 🟡 84% (38/45) 🟡 84% (38/45) 🟡 60% (27/45) 🟡 36% (16/45) 🟡 82% (37/45)
logs 🟡 43% (13/30) 🟡 57% (17/30) 🟡 50% (15/30) 🟡 27% (8/30) 🟡 57% (17/30) 🟡 27% (8/30) 🟡 70% (21/30) 🟡 60% (18/30) 🟡 33% (10/30) 🟡 10% (3/30) 🟡 63% (19/30)
loki 🟡 20% (2/10) 🟡 30% (3/10) 🟡 30% (3/10) 🔴 0% (0/10) 🟡 20% (2/10) 🟡 10% (1/10) 🟡 30% (3/10) 🟡 30% (3/10) 🟡 10% (1/10) 🟡 20% (2/10) 🟡 20% (2/10)
medium 🟡 70% (21/30) 🟡 87% (26/30) 🟡 73% (22/30) 🟡 47% (14/30) 🟡 87% (26/30) 🟡 57% (17/30) 🟡 87% (26/30) 🟡 90% (27/30) 🟡 40% (12/30) 🟡 20% (6/30) 🟡 87% (26/30)
metrics 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 60% (⅗) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 20% (⅕) 🟢 100% (5/5)
network 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 20% (⅕) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 40% (⅖) 🟢 100% (5/5) 🟢 100% (5/5)
one-test 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 80% (⅘) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🔴 0% (0/5) 🟢 100% (5/5)
port-forward 🟡 47% (7/15) 🟡 53% (8/15) 🟡 53% (8/15) 🟡 20% (3/15) 🟡 47% (7/15) 🟡 40% (6/15) 🟡 53% (8/15) 🟡 53% (8/15) 🟡 40% (6/15) 🟡 20% (3/15) 🟡 47% (7/15)
question-answer 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 60% (⅗) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 20% (⅕) 🟢 100% (5/5)
regression 🟡 91% (50/55) 🟡 84% (46/55) 🟡 93% (51/55) 🟡 71% (39/55) 🟡 85% (47/55) 🟡 87% (48/55) 🟡 93% (51/55) 🟡 93% (51/55) 🟡 56% (31/55) 🟡 47% (26/55) 🟡 93% (51/55)
skills 🟢 100% (5/5) 🟢 100% (5/5) 🟡 60% (⅗) 🟡 60% (⅗) 🟢 100% (5/5) 🟢 100% (5/5) 🟡 80% (⅘) 🟢 100% (5/5) 🟡 40% (⅖) 🟡 20% (⅕) 🟢 100% (5/5)
Overall 🟡 79% (67/85) 🟡 79% (67/85) 🟡 80% (68/85) 🟡 56% (48/85) 🟡 80% (68/85) 🟡 71% (60/85) 🟡 88% (75/85) 🟡 86% (73/85) 🟡 51% (43/85) 🟡 34% (29/85) 🟡 87% (74/85)

Raw Results

Status of all evaluations across models. Color coding:

  • 🟢 Passing 100% (stable)
  • 🟡 Passing 1-99%
  • 🔴 Passing 0% (failing)
  • 🔧 Mock data failure (missing or invalid test data)
  • ⚠️ Setup failure (environment/infrastructure issue)
  • ⏱️ Timeout or rate limit error
  • ⏭️ Test skipped (e.g., known issue or precondition not met)
Eval ID deepseek-r1-reasoner deepseek-v3.2-chat gemini-3.1-pro-preview gpt-5.3-codex gpt-5.4 haiku-4.5 opus-4.6 opus-4.7 qwen-next-80B-instruct qwen-next-80B-thinking sonnet-4.6
09_crashpod 🔗 🟢 🟢 🟢 🟡 🟢 🟢 🟢 🟢 🟢 🔴 🟢
100a_loki_historical_logs 🔗 🟡 🟡 🟡 🔴 🟡 🔴 🟡 🟡 🟡 🟡 🟡
101_loki_historical_logs_pod_deleted 🔗 🟡 🟡 🟡 🔴 🟡 🟡 🟡 🟡 🔴 🟡 🟡
108_logs_nearby_lines 🔗 🟡 🔴 🔴 🔴 🔴 🔴 🟡 🔴 🔴 🔴 🟡
112_find_pvcs_by_uuid 🔗 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟡 🟡 🟢
12_job_crashing 🔗 🟢 🟢 🟢 🟢 🟡 🟢 🟢 🟢 🔴 🟡 🟢
176_network_policy_blocking_traffic_no_skills 🔗 🟢 🟢 🟢 🟡 🟢 🟢 🟢 🟢 🟡 🟢 🟢
179_grafana_big_dashboard_query 🔗 🟢 🟢 🟢 🟡 🟢 🟢 🟢 🟢 🟢 🟡 🟢
227_count_configmaps_per_namespace[0] 🔗 🟢 🟢 🟢 🟡 🟢 🟢 🟢 🟢 🟢 🟡 🟢
243_pod_names_contain_service 🔗 🟢 🟢 🟢 🟡 🟢 🟢 🟢 🟢 🟡 🔴 🟢
24_misconfigured_pvc 🔗 🟢 🟢 🟢 🟡 🟡 🟡 🟢 🟢 🟡 🔴 🟢
43_current_datetime_from_prompt 🔗 🟡 🔴 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢
51_logs_summarize_errors 🔗 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟡 🟢
61_exact_match_counting 🔗 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🟢 🔴 🟡 🟢
73a_time_window_anomaly 🔗 🟡 🟡 🟡 🟡 🟢 🟡 🟢 🟢 🟡 🔴 🟢
73b_time_window_anomaly 🔗 🟡 🟢 🟡 🟡 🟢 🟡 🟢 🟢 🟡 🔴 🟢
96_no_matching_skill 🔗 🟢 🟢 🟡 🟡 🟢 🟢 🟡 🟢 🟡 🟡 🟢
SUMMARY 🟡 79% (67/85) 🟡 79% (67/85) 🟡 80% (68/85) 🟡 56% (48/85) 🟡 80% (68/85) 🟡 71% (60/85) 🟡 88% (75/85) 🟡 86% (73/85) 🟡 51% (43/85) 🟡 34% (29/85) 🟡 87% (74/85)

Detailed Raw Results

Eval ID deepseek-r1-reasoner deepseek-v3.2-chat gemini-3.1-pro-preview gpt-5.3-codex gpt-5.4 haiku-4.5 opus-4.6 opus-4.7 qwen-next-80B-instruct qwen-next-80B-thinking sonnet-4.6
09_crashpod 🔗 🟢 100% (5/5) / ⏱️ 27.8s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 26.5s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 33.2s / 💰 $0.10 🟡 80% (⅘) / ⏱️ 17.5s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 27.3s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 27.3s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 37.7s / 💰 $0.27 🟢 100% (5/5) / ⏱️ 41.6s / 💰 $0.26 🟢 100% (5/5) / ⏱️ 33.0s / 💰 $0.04 🔴 0% (0/5) / ⏱️ 11.8s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 38.8s / 💰 $0.18
100a_loki_historical_logs 🔗 🟡 20% (⅕) / ⏱️ 104.9s / 💰 $0.02 🟡 40% (⅖) / ⏱️ 106.0s / 💰 $0.03 🟡 40% (⅖) / ⏱️ 323.4s / 💰 $0.28 🔴 0% (0/5) / ⏱️ 25.2s / 💰 $0.02 🟡 20% (⅕) / ⏱️ 45.9s / 💰 $0.10 🔴 0% (0/5) / ⏱️ 35.1s / 💰 $0.08 🟡 40% (⅖) / ⏱️ 135.4s / 💰 $0.68 🟡 40% (⅖) / ⏱️ 93.8s / 💰 $0.71 🟡 20% (⅕) / ⏱️ 34.6s / 💰 $0.04 🟡 20% (⅕) / ⏱️ 74.7s / 💰 $0.05 🟡 20% (⅕) / ⏱️ 60.4s / 💰 $0.26
101_loki_historical_logs_pod_deleted 🔗 🟡 20% (⅕) / ⏱️ 87.1s / 💰 $0.01 🟡 20% (⅕) / ⏱️ 139.4s / 💰 $0.03 🟡 20% (⅕) / ⏱️ 476.9s / 💰 $0.10 🔴 0% (0/5) / ⏱️ 12.8s / 💰 $0.02 🟡 20% (⅕) / ⏱️ 44.1s / 💰 $0.09 🟡 20% (⅕) / ⏱️ 31.4s / 💰 $0.06 🟡 20% (⅕) / ⏱️ 385.1s / 💰 $2.03 🟡 20% (⅕) / ⏱️ 81.8s / 💰 $0.45 🔴 0% (0/5) / ⏱️ 31.9s / 💰 $0.03 🟡 20% (⅕) / ⏱️ 189.6s / 💰 $0.05 🟡 20% (⅕) / ⏱️ 50.7s / 💰 $0.21
108_logs_nearby_lines 🔗 🟡 20% (⅕) / ⏱️ 44.6s / 💰 $0.01 🔴 0% (0/5) / ⏱️ 34.3s / 💰 $0.01 🔴 0% (0/5) / ⏱️ 69.4s / 💰 $0.23 🔴 0% (0/5) / ⏱️ 20.0s / 💰 $0.04 🔴 0% (0/5) / ⏱️ 40.9s / 💰 $0.11 🔴 0% (0/5) / ⏱️ 41.0s / 💰 $0.09 🟡 60% (⅗) / ⏱️ 69.0s / 💰 $0.45 🔴 0% (0/5) / ⏱️ 60.5s / 💰 $0.39 🔴 0% (0/5) / ⏱️ 66.7s / 💰 $0.09 🔴 0% (0/5) / ⏱️ 63.0s / 💰 $0.04 🟡 40% (⅖) / ⏱️ 44.6s / 💰 $0.20
112_find_pvcs_by_uuid 🔗 🟢 100% (5/5) / ⏱️ 20.7s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 15.8s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 28.6s / 💰 $0.07 🟢 100% (5/5) / ⏱️ 13.2s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 21.5s / 💰 $0.05 🟢 100% (5/5) / ⏱️ 21.1s / 💰 $0.05 🟢 100% (5/5) / ⏱️ 25.1s / 💰 $0.21 🟢 100% (5/5) / ⏱️ 22.9s / 💰 $0.18 🟡 60% (⅗) / ⏱️ 20.4s / 💰 $0.02 🟡 80% (⅘) / ⏱️ 42.3s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 20.8s / 💰 $0.12
12_job_crashing 🔗 🟢 100% (5/5) / ⏱️ 35.4s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 34.9s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 64.4s / 💰 $0.19 🟢 100% (5/5) / ⏱️ 18.6s / 💰 $0.03 🟡 40% (⅖) / ⏱️ 36.3s / 💰 $0.08 🟢 100% (5/5) / ⏱️ 31.2s / 💰 $0.07 🟢 100% (5/5) / ⏱️ 48.9s / 💰 $0.32 🟢 100% (5/5) / ⏱️ 49.7s / 💰 $0.30 🔴 0% (0/5) / ⏱️ 26.8s / 💰 $0.02 🟡 40% (⅖) / ⏱️ 61.8s / 💰 $0.04 🟢 100% (5/5) / ⏱️ 47.2s / 💰 $0.21
176_network_policy_blocking_traffic_no_skills 🔗 🟢 100% (5/5) / ⏱️ 46.8s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 40.2s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 40.7s / 💰 $0.15 🟡 20% (⅕) / ⏱️ 24.1s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 42.9s / 💰 $0.11 🟢 100% (5/5) / ⏱️ 39.1s / 💰 $0.08 🟢 100% (5/5) / ⏱️ 56.7s / 💰 $0.35 🟢 100% (5/5) / ⏱️ 40.6s / 💰 $0.42 🟡 40% (⅖) / ⏱️ 44.7s / 💰 $0.04 🟢 100% (5/5) / ⏱️ 84.8s / 💰 $0.07 🟢 100% (5/5) / ⏱️ 44.3s / 💰 $0.21
179_grafana_big_dashboard_query 🔗 🟢 100% (5/5) / ⏱️ 17.6s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 12.7s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 21.9s / 💰 $0.09 🟡 60% (⅗) / ⏱️ 12.6s / 💰 $0.05 🟢 100% (5/5) / ⏱️ 17.4s / 💰 $0.07 🟢 100% (5/5) / ⏱️ 23.2s / 💰 $0.07 🟢 100% (5/5) / ⏱️ 23.2s / 💰 $0.24 🟢 100% (5/5) / ⏱️ 26.7s / 💰 $0.32 🟢 100% (5/5) / ⏱️ 16.1s / 💰 $0.01 🟡 20% (⅕) / ⏱️ 46.9s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 20.1s / 💰 $0.13
227_count_configmaps_per_namespace[0] 🔗 🟢 100% (5/5) / ⏱️ 16.7s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 14.7s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 31.8s / 💰 $0.08 🟡 80% (⅘) / ⏱️ 17.4s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 22.3s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 17.9s / 💰 $0.04 🟢 100% (5/5) / ⏱️ 23.1s / 💰 $0.21 🟢 100% (5/5) / ⏱️ 38.7s / 💰 $0.23 🟢 100% (5/5) / ⏱️ 34.3s / 💰 $0.04 🟡 80% (⅘) / ⏱️ 306.5s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 22.2s / 💰 $0.13
243_pod_names_contain_service 🔗 🟢 100% (5/5) / ⏱️ 32.9s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 25.7s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 27.0s / 💰 $0.08 🟡 60% (⅗) / ⏱️ 16.6s / 💰 $0.02 🟢 100% (5/5) / ⏱️ 31.3s / 💰 $0.07 🟢 100% (5/5) / ⏱️ 27.8s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 38.3s / 💰 $0.26 🟢 100% (5/5) / ⏱️ 40.5s / 💰 $0.23 🟡 40% (⅖) / ⏱️ 36.8s / 💰 $0.04 🔴 0% (0/5) / ⏱️ 7.2s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 40.8s / 💰 $0.18
24_misconfigured_pvc 🔗 🟢 100% (5/5) / ⏱️ 34.7s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 28.1s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 40.6s / 💰 $0.11 🟡 40% (⅖) / ⏱️ 14.2s / 💰 $0.02 🟡 80% (⅘) / ⏱️ 33.8s / 💰 $0.07 🟡 40% (⅖) / ⏱️ 14.5s / 💰 $0.04 🟢 100% (5/5) / ⏱️ 43.4s / 💰 $0.30 🟢 100% (5/5) / ⏱️ 32.3s / 💰 $0.32 🟡 80% (⅘) / ⏱️ 40.0s / 💰 $0.04 🔴 0% (0/5) / ⏱️ 5.7s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 41.3s / 💰 $0.19
43_current_datetime_from_prompt 🔗 🟡 80% (⅘) / ⏱️ 5.9s / 💰 $0.00 🔴 0% (0/5) / ⏱️ 4.8s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 22.4s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 4.0s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 9.8s / 💰 $0.02 🟢 100% (5/5) / ⏱️ 5.2s / 💰 $0.02 🟢 100% (5/5) / ⏱️ 6.1s / 💰 $0.12 🟢 100% (5/5) / ⏱️ 12.1s / 💰 $0.10 🟢 100% (5/5) / ⏱️ 5.0s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 9.4s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 4.9s / 💰 $0.07
51_logs_summarize_errors 🔗 🟢 100% (5/5) / ⏱️ 14.9s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 13.2s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 25.3s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 14.4s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 37.0s / 💰 $0.05 🟢 100% (5/5) / ⏱️ 18.6s / 💰 $0.04 🟢 100% (5/5) / ⏱️ 24.7s / 💰 $0.20 🟢 100% (5/5) / ⏱️ 28.5s / 💰 $0.17 🟢 100% (5/5) / ⏱️ 21.9s / 💰 $0.02 🟡 20% (⅕) / ⏱️ 19.9s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 23.6s / 💰 $0.12
61_exact_match_counting 🔗 🟢 100% (5/5) / ⏱️ 7.8s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 8.4s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 15.4s / 💰 $0.04 🟢 100% (5/5) / ⏱️ 9.0s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 9.3s / 💰 $0.02 🟢 100% (5/5) / ⏱️ 10.5s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 12.7s / 💰 $0.15 🟢 100% (5/5) / ⏱️ 15.7s / 💰 $0.09 🔴 0% (0/5) / ⏱️ 10.5s / 💰 $0.01 🟡 80% (⅘) / ⏱️ 26.1s / 💰 $0.02 🟢 100% (5/5) / ⏱️ 10.9s / 💰 $0.09
73a_time_window_anomaly 🔗 🟡 40% (⅖) / ⏱️ 30.7s / 💰 $0.01 🟡 80% (⅘) / ⏱️ 28.0s / 💰 $0.01 🟡 80% (⅘) / ⏱️ 34.4s / 💰 $0.11 🟡 20% (⅕) / ⏱️ 14.4s / 💰 $0.02 🟢 100% (5/5) / ⏱️ 35.4s / 💰 $0.10 🟡 20% (⅕) / ⏱️ 29.2s / 💰 $0.06 🟢 100% (5/5) / ⏱️ 70.1s / 💰 $0.38 🟢 100% (5/5) / ⏱️ 37.9s / 💰 $0.25 🟡 40% (⅖) / ⏱️ 28.0s / 💰 $0.02 🔴 0% (0/5) / ⏱️ 13.7s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 51.6s / 💰 $0.21
73b_time_window_anomaly 🔗 🟡 60% (⅗) / ⏱️ 40.0s / 💰 $0.01 🟢 100% (5/5) / ⏱️ 29.1s / 💰 $0.01 🟡 60% (⅗) / ⏱️ 43.4s / 💰 $0.14 🟡 40% (⅖) / ⏱️ 16.4s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 38.8s / 💰 $0.10 🟡 20% (⅕) / ⏱️ 22.8s / 💰 $0.05 🟢 100% (5/5) / ⏱️ 70.5s / 💰 $0.39 🟢 100% (5/5) / ⏱️ 36.0s / 💰 $0.26 🟡 40% (⅖) / ⏱️ 25.4s / 💰 $0.02 🔴 0% (0/5) / ⏱️ 11.4s / 💰 $0.00 🟢 100% (5/5) / ⏱️ 49.9s / 💰 $0.21
96_no_matching_skill 🔗 🟢 100% (5/5) / ⏱️ 63.6s / 💰 $0.02 🟢 100% (5/5) / ⏱️ 49.1s / 💰 $0.02 🟡 60% (⅗) / ⏱️ 81.6s / 💰 $0.33 🟡 60% (⅗) / ⏱️ 48.9s / 💰 $0.03 🟢 100% (5/5) / ⏱️ 46.8s / 💰 $0.15 🟢 100% (5/5) / ⏱️ 40.5s / 💰 $0.09 🟡 80% (⅘) / ⏱️ 129.5s / 💰 $0.55 🟢 100% (5/5) / ⏱️ 61.7s / 💰 $0.52 🟡 40% (⅖) / ⏱️ 44.9s / 💰 $0.05 🟡 20% (⅕) / ⏱️ 87.2s / 💰 $0.07 🟢 100% (5/5) / ⏱️ 59.9s / 💰 $0.26

Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-26358788343.