⚡ May 24, 2026¶

Generated: 2026-05-24 13:28 UTC
Total Duration: 2h 55m 18s
Iterations: 5
Judge (classifier) model: gpt-4.1

Fast Benchmark

Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes

HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.

If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.

Model Accuracy Comparison¶

Model	Pass	Fail	Total	Success Rate
deepseek-r1-reasoner	67	18	85	🟡 79% (67/85)
deepseek-v3.2-chat	67	18	85	🟡 79% (67/85)
gemini-3.1-pro-preview	68	17	85	🟡 80% (68/85)
gpt-5.3-codex	48	37	85	🟡 56% (48/85)
gpt-5.4	68	17	85	🟡 80% (68/85)
haiku-4.5	60	25	85	🟡 71% (60/85)
opus-4.6	75	10	85	🟡 88% (75/85)
opus-4.7	73	12	85	🟡 86% (73/85)
qwen-next-80B-instruct	43	42	85	🟡 51% (43/85)
qwen-next-80B-thinking	29	56	85	🟡 34% (29/85)
sonnet-4.6	74	11	85	🟡 87% (74/85)

Model Cost Comparison¶

Model	Tests	Avg Cost	Min Cost	Max Cost	Total Cost
deepseek-r1-reasoner	85	$0.01	$0.00	$0.03	$0.75
deepseek-v3.2-chat	85	$0.01	$0.00	$0.06	$0.84
gemini-3.1-pro-preview	80	$0.14	$0.03	$0.70	$11.17
gpt-5.3-codex	85	$0.03	$0.00	$0.07	$2.39
gpt-5.4	85	$0.08	$0.02	$0.17	$6.66
haiku-4.5	85	$0.06	$0.02	$0.13	$5.01
opus-4.6	85	$0.42	$0.12	$4.21	$35.51
opus-4.7	85	$0.31	$0.06	$0.84	$26.04
qwen-next-80B-instruct	85	$0.03	$0.00	$0.10	$2.65
qwen-next-80B-thinking	85	$0.03	$0.00	$0.11	$2.24
sonnet-4.6	85	$0.18	$0.07	$0.35	$15.03

Model Latency Comparison¶

Model	Avg (s)	Min (s)	Max (s)	P50 (s)	P95 (s)
deepseek-r1-reasoner	37.2	3.6	141.8	30.7	103.3
deepseek-v3.2-chat	35.9	4.2	275.6	27.7	85.3
gemini-3.1-pro-preview	55.4	12.3	589.1	33.9	121.1
gpt-5.3-codex	15.0	3.6	28.7	14.2	23.5
gpt-5.4	31.8	7.3	83.8	32.4	50.7
haiku-4.5	25.7	4.6	55.9	25.6	42.2
opus-4.6	70.6	5.9	744.1	44.2	173.4
opus-4.7	42.4	8.7	206.5	36.2	81.7
qwen-next-80B-instruct	30.7	4.3	80.5	29.7	56.4
qwen-next-80B-thinking	51.0	4.9	709.6	26.3	102.4
sonnet-4.6	37.2	4.0	78.0	40.8	64.6

⚠️ Note: 6 test(s) excluded from latency calculations due to throttling/timeout errors (gemini-3.1-pro-preview: 3, gpt-5.3-codex: 2, qwen-next-80B-thinking: 1)

Performance by Tag¶

Success rate by test category and model:

Tag	deepseek-r1-reasoner	deepseek-v3.2-chat	gemini-3.1-pro-preview	gpt-5.3-codex	gpt-5.4	haiku-4.5	opus-4.6	opus-4.7	qwen-next-80B-instruct	qwen-next-80B-thinking	sonnet-4.6
benchmark	🟡 57% (17/30)	🟡 70% (21/30)	🟡 57% (17/30)	🟡 30% (9/30)	🟡 70% (21/30)	🟡 40% (12/30)	🟡 80% (24/30)	🟡 73% (22/30)	🟡 40% (12/30)	🟡 10% (3/30)	🟡 77% (23/30)
context_window	🟡 50% (5/10)	🟡 90% (9/10)	🟡 70% (7/10)	🟡 30% (3/10)	🟢 100% (10/10)	🟡 20% (2/10)	🟢 100% (10/10)	🟢 100% (10/10)	🟡 40% (4/10)	🔴 0% (0/10)	🟢 100% (10/10)
counting	🟢 100% (10/10)	🟢 100% (10/10)	🟢 100% (10/10)	🟡 90% (9/10)	🟢 100% (10/10)	🟢 100% (10/10)	🟢 100% (10/10)	🟢 100% (10/10)	🟡 50% (5/10)	🟡 80% (8/10)	🟢 100% (10/10)
datetime	🟡 60% (9/15)	🟡 60% (9/15)	🟡 80% (12/15)	🟡 53% (8/15)	🟢 100% (15/15)	🟡 47% (7/15)	🟢 100% (15/15)	🟢 100% (15/15)	🟡 60% (9/15)	🟡 33% (5/15)	🟢 100% (15/15)
easy	🟡 88% (35/40)	🟡 78% (31/40)	🟡 90% (36/40)	🟡 68% (27/40)	🟡 80% (32/40)	🟡 82% (33/40)	🟡 90% (36/40)	🟡 90% (36/40)	🟡 52% (21/40)	🟡 45% (18/40)	🟡 90% (36/40)
grafana	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 60% (⅗)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 20% (⅕)	🟢 100% (5/5)
hard	🟡 60% (6/10)	🟡 50% (5/10)	🟡 50% (5/10)	🟡 30% (3/10)	🟡 50% (5/10)	🟡 50% (5/10)	🟡 80% (8/10)	🟡 50% (5/10)	🟡 50% (5/10)	🟡 10% (1/10)	🟡 70% (7/10)
kubernetes	🟡 82% (37/45)	🟡 84% (38/45)	🟡 84% (38/45)	🟡 49% (22/45)	🟡 80% (36/45)	🟡 73% (33/45)	🟡 84% (38/45)	🟡 84% (38/45)	🟡 60% (27/45)	🟡 36% (16/45)	🟡 82% (37/45)
logs	🟡 43% (13/30)	🟡 57% (17/30)	🟡 50% (15/30)	🟡 27% (8/30)	🟡 57% (17/30)	🟡 27% (8/30)	🟡 70% (21/30)	🟡 60% (18/30)	🟡 33% (10/30)	🟡 10% (3/30)	🟡 63% (19/30)
loki	🟡 20% (2/10)	🟡 30% (3/10)	🟡 30% (3/10)	🔴 0% (0/10)	🟡 20% (2/10)	🟡 10% (1/10)	🟡 30% (3/10)	🟡 30% (3/10)	🟡 10% (1/10)	🟡 20% (2/10)	🟡 20% (2/10)
medium	🟡 70% (21/30)	🟡 87% (26/30)	🟡 73% (22/30)	🟡 47% (14/30)	🟡 87% (26/30)	🟡 57% (17/30)	🟡 87% (26/30)	🟡 90% (27/30)	🟡 40% (12/30)	🟡 20% (6/30)	🟡 87% (26/30)
metrics	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 60% (⅗)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 20% (⅕)	🟢 100% (5/5)
network	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 20% (⅕)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 40% (⅖)	🟢 100% (5/5)	🟢 100% (5/5)
one-test	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 80% (⅘)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🔴 0% (0/5)	🟢 100% (5/5)
port-forward	🟡 47% (7/15)	🟡 53% (8/15)	🟡 53% (8/15)	🟡 20% (3/15)	🟡 47% (7/15)	🟡 40% (6/15)	🟡 53% (8/15)	🟡 53% (8/15)	🟡 40% (6/15)	🟡 20% (3/15)	🟡 47% (7/15)
question-answer	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 60% (⅗)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 20% (⅕)	🟢 100% (5/5)
regression	🟡 91% (50/55)	🟡 84% (46/55)	🟡 93% (51/55)	🟡 71% (39/55)	🟡 85% (47/55)	🟡 87% (48/55)	🟡 93% (51/55)	🟡 93% (51/55)	🟡 56% (31/55)	🟡 47% (26/55)	🟡 93% (51/55)
skills	🟢 100% (5/5)	🟢 100% (5/5)	🟡 60% (⅗)	🟡 60% (⅗)	🟢 100% (5/5)	🟢 100% (5/5)	🟡 80% (⅘)	🟢 100% (5/5)	🟡 40% (⅖)	🟡 20% (⅕)	🟢 100% (5/5)
Overall	🟡 79% (67/85)	🟡 79% (67/85)	🟡 80% (68/85)	🟡 56% (48/85)	🟡 80% (68/85)	🟡 71% (60/85)	🟡 88% (75/85)	🟡 86% (73/85)	🟡 51% (43/85)	🟡 34% (29/85)	🟡 87% (74/85)

Raw Results¶

Status of all evaluations across models. Color coding:

🟢 Passing 100% (stable)
🟡 Passing 1-99%
🔴 Passing 0% (failing)
🔧 Mock data failure (missing or invalid test data)
⚠️ Setup failure (environment/infrastructure issue)
⏱️ Timeout or rate limit error
⏭️ Test skipped (e.g., known issue or precondition not met)

Eval ID	deepseek-r1-reasoner	deepseek-v3.2-chat	gemini-3.1-pro-preview	gpt-5.3-codex	gpt-5.4	haiku-4.5	opus-4.6	opus-4.7	qwen-next-80B-instruct	qwen-next-80B-thinking	sonnet-4.6
09_crashpod 🔗	🟢	🟢	🟢	🟡	🟢	🟢	🟢	🟢	🟢	🔴	🟢
100a_loki_historical_logs 🔗	🟡	🟡	🟡	🔴	🟡	🔴	🟡	🟡	🟡	🟡	🟡
101_loki_historical_logs_pod_deleted 🔗	🟡	🟡	🟡	🔴	🟡	🟡	🟡	🟡	🔴	🟡	🟡
108_logs_nearby_lines 🔗	🟡	🔴	🔴	🔴	🔴	🔴	🟡	🔴	🔴	🔴	🟡
112_find_pvcs_by_uuid 🔗	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟡	🟡	🟢
12_job_crashing 🔗	🟢	🟢	🟢	🟢	🟡	🟢	🟢	🟢	🔴	🟡	🟢
176_network_policy_blocking_traffic_no_skills 🔗	🟢	🟢	🟢	🟡	🟢	🟢	🟢	🟢	🟡	🟢	🟢
179_grafana_big_dashboard_query 🔗	🟢	🟢	🟢	🟡	🟢	🟢	🟢	🟢	🟢	🟡	🟢
227_count_configmaps_per_namespace[0] 🔗	🟢	🟢	🟢	🟡	🟢	🟢	🟢	🟢	🟢	🟡	🟢
243_pod_names_contain_service 🔗	🟢	🟢	🟢	🟡	🟢	🟢	🟢	🟢	🟡	🔴	🟢
24_misconfigured_pvc 🔗	🟢	🟢	🟢	🟡	🟡	🟡	🟢	🟢	🟡	🔴	🟢
43_current_datetime_from_prompt 🔗	🟡	🔴	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢
51_logs_summarize_errors 🔗	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟡	🟢
61_exact_match_counting 🔗	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🟢	🔴	🟡	🟢
73a_time_window_anomaly 🔗	🟡	🟡	🟡	🟡	🟢	🟡	🟢	🟢	🟡	🔴	🟢
73b_time_window_anomaly 🔗	🟡	🟢	🟡	🟡	🟢	🟡	🟢	🟢	🟡	🔴	🟢
96_no_matching_skill 🔗	🟢	🟢	🟡	🟡	🟢	🟢	🟡	🟢	🟡	🟡	🟢
SUMMARY	🟡 79% (67/85)	🟡 79% (67/85)	🟡 80% (68/85)	🟡 56% (48/85)	🟡 80% (68/85)	🟡 71% (60/85)	🟡 88% (75/85)	🟡 86% (73/85)	🟡 51% (43/85)	🟡 34% (29/85)	🟡 87% (74/85)

Detailed Raw Results¶

Eval ID	deepseek-r1-reasoner	deepseek-v3.2-chat	gemini-3.1-pro-preview	gpt-5.3-codex	gpt-5.4	haiku-4.5	opus-4.6	opus-4.7	qwen-next-80B-instruct	qwen-next-80B-thinking	sonnet-4.6
09_crashpod 🔗	🟢 100% (5/5) / ⏱️ 27.8s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 26.5s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 33.2s / 💰 $0.10	🟡 80% (⅘) / ⏱️ 17.5s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 27.3s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 27.3s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 37.7s / 💰 $0.27	🟢 100% (5/5) / ⏱️ 41.6s / 💰 $0.26	🟢 100% (5/5) / ⏱️ 33.0s / 💰 $0.04	🔴 0% (0/5) / ⏱️ 11.8s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 38.8s / 💰 $0.18
100a_loki_historical_logs 🔗	🟡 20% (⅕) / ⏱️ 104.9s / 💰 $0.02	🟡 40% (⅖) / ⏱️ 106.0s / 💰 $0.03	🟡 40% (⅖) / ⏱️ 323.4s / 💰 $0.28	🔴 0% (0/5) / ⏱️ 25.2s / 💰 $0.02	🟡 20% (⅕) / ⏱️ 45.9s / 💰 $0.10	🔴 0% (0/5) / ⏱️ 35.1s / 💰 $0.08	🟡 40% (⅖) / ⏱️ 135.4s / 💰 $0.68	🟡 40% (⅖) / ⏱️ 93.8s / 💰 $0.71	🟡 20% (⅕) / ⏱️ 34.6s / 💰 $0.04	🟡 20% (⅕) / ⏱️ 74.7s / 💰 $0.05	🟡 20% (⅕) / ⏱️ 60.4s / 💰 $0.26
101_loki_historical_logs_pod_deleted 🔗	🟡 20% (⅕) / ⏱️ 87.1s / 💰 $0.01	🟡 20% (⅕) / ⏱️ 139.4s / 💰 $0.03	🟡 20% (⅕) / ⏱️ 476.9s / 💰 $0.10	🔴 0% (0/5) / ⏱️ 12.8s / 💰 $0.02	🟡 20% (⅕) / ⏱️ 44.1s / 💰 $0.09	🟡 20% (⅕) / ⏱️ 31.4s / 💰 $0.06	🟡 20% (⅕) / ⏱️ 385.1s / 💰 $2.03	🟡 20% (⅕) / ⏱️ 81.8s / 💰 $0.45	🔴 0% (0/5) / ⏱️ 31.9s / 💰 $0.03	🟡 20% (⅕) / ⏱️ 189.6s / 💰 $0.05	🟡 20% (⅕) / ⏱️ 50.7s / 💰 $0.21
108_logs_nearby_lines 🔗	🟡 20% (⅕) / ⏱️ 44.6s / 💰 $0.01	🔴 0% (0/5) / ⏱️ 34.3s / 💰 $0.01	🔴 0% (0/5) / ⏱️ 69.4s / 💰 $0.23	🔴 0% (0/5) / ⏱️ 20.0s / 💰 $0.04	🔴 0% (0/5) / ⏱️ 40.9s / 💰 $0.11	🔴 0% (0/5) / ⏱️ 41.0s / 💰 $0.09	🟡 60% (⅗) / ⏱️ 69.0s / 💰 $0.45	🔴 0% (0/5) / ⏱️ 60.5s / 💰 $0.39	🔴 0% (0/5) / ⏱️ 66.7s / 💰 $0.09	🔴 0% (0/5) / ⏱️ 63.0s / 💰 $0.04	🟡 40% (⅖) / ⏱️ 44.6s / 💰 $0.20
112_find_pvcs_by_uuid 🔗	🟢 100% (5/5) / ⏱️ 20.7s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 15.8s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 28.6s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 13.2s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 21.5s / 💰 $0.05	🟢 100% (5/5) / ⏱️ 21.1s / 💰 $0.05	🟢 100% (5/5) / ⏱️ 25.1s / 💰 $0.21	🟢 100% (5/5) / ⏱️ 22.9s / 💰 $0.18	🟡 60% (⅗) / ⏱️ 20.4s / 💰 $0.02	🟡 80% (⅘) / ⏱️ 42.3s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 20.8s / 💰 $0.12
12_job_crashing 🔗	🟢 100% (5/5) / ⏱️ 35.4s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 34.9s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 64.4s / 💰 $0.19	🟢 100% (5/5) / ⏱️ 18.6s / 💰 $0.03	🟡 40% (⅖) / ⏱️ 36.3s / 💰 $0.08	🟢 100% (5/5) / ⏱️ 31.2s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 48.9s / 💰 $0.32	🟢 100% (5/5) / ⏱️ 49.7s / 💰 $0.30	🔴 0% (0/5) / ⏱️ 26.8s / 💰 $0.02	🟡 40% (⅖) / ⏱️ 61.8s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 47.2s / 💰 $0.21
176_network_policy_blocking_traffic_no_skills 🔗	🟢 100% (5/5) / ⏱️ 46.8s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 40.2s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 40.7s / 💰 $0.15	🟡 20% (⅕) / ⏱️ 24.1s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 42.9s / 💰 $0.11	🟢 100% (5/5) / ⏱️ 39.1s / 💰 $0.08	🟢 100% (5/5) / ⏱️ 56.7s / 💰 $0.35	🟢 100% (5/5) / ⏱️ 40.6s / 💰 $0.42	🟡 40% (⅖) / ⏱️ 44.7s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 84.8s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 44.3s / 💰 $0.21
179_grafana_big_dashboard_query 🔗	🟢 100% (5/5) / ⏱️ 17.6s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 12.7s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 21.9s / 💰 $0.09	🟡 60% (⅗) / ⏱️ 12.6s / 💰 $0.05	🟢 100% (5/5) / ⏱️ 17.4s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 23.2s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 23.2s / 💰 $0.24	🟢 100% (5/5) / ⏱️ 26.7s / 💰 $0.32	🟢 100% (5/5) / ⏱️ 16.1s / 💰 $0.01	🟡 20% (⅕) / ⏱️ 46.9s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 20.1s / 💰 $0.13
227_count_configmaps_per_namespace[0] 🔗	🟢 100% (5/5) / ⏱️ 16.7s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 14.7s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 31.8s / 💰 $0.08	🟡 80% (⅘) / ⏱️ 17.4s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 22.3s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 17.9s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 23.1s / 💰 $0.21	🟢 100% (5/5) / ⏱️ 38.7s / 💰 $0.23	🟢 100% (5/5) / ⏱️ 34.3s / 💰 $0.04	🟡 80% (⅘) / ⏱️ 306.5s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 22.2s / 💰 $0.13
243_pod_names_contain_service 🔗	🟢 100% (5/5) / ⏱️ 32.9s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 25.7s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 27.0s / 💰 $0.08	🟡 60% (⅗) / ⏱️ 16.6s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 31.3s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 27.8s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 38.3s / 💰 $0.26	🟢 100% (5/5) / ⏱️ 40.5s / 💰 $0.23	🟡 40% (⅖) / ⏱️ 36.8s / 💰 $0.04	🔴 0% (0/5) / ⏱️ 7.2s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 40.8s / 💰 $0.18
24_misconfigured_pvc 🔗	🟢 100% (5/5) / ⏱️ 34.7s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 28.1s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 40.6s / 💰 $0.11	🟡 40% (⅖) / ⏱️ 14.2s / 💰 $0.02	🟡 80% (⅘) / ⏱️ 33.8s / 💰 $0.07	🟡 40% (⅖) / ⏱️ 14.5s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 43.4s / 💰 $0.30	🟢 100% (5/5) / ⏱️ 32.3s / 💰 $0.32	🟡 80% (⅘) / ⏱️ 40.0s / 💰 $0.04	🔴 0% (0/5) / ⏱️ 5.7s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 41.3s / 💰 $0.19
43_current_datetime_from_prompt 🔗	🟡 80% (⅘) / ⏱️ 5.9s / 💰 $0.00	🔴 0% (0/5) / ⏱️ 4.8s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 22.4s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 4.0s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 9.8s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 5.2s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 6.1s / 💰 $0.12	🟢 100% (5/5) / ⏱️ 12.1s / 💰 $0.10	🟢 100% (5/5) / ⏱️ 5.0s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 9.4s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 4.9s / 💰 $0.07
51_logs_summarize_errors 🔗	🟢 100% (5/5) / ⏱️ 14.9s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 13.2s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 25.3s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 14.4s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 37.0s / 💰 $0.05	🟢 100% (5/5) / ⏱️ 18.6s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 24.7s / 💰 $0.20	🟢 100% (5/5) / ⏱️ 28.5s / 💰 $0.17	🟢 100% (5/5) / ⏱️ 21.9s / 💰 $0.02	🟡 20% (⅕) / ⏱️ 19.9s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 23.6s / 💰 $0.12
61_exact_match_counting 🔗	🟢 100% (5/5) / ⏱️ 7.8s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 8.4s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 15.4s / 💰 $0.04	🟢 100% (5/5) / ⏱️ 9.0s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 9.3s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 10.5s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 12.7s / 💰 $0.15	🟢 100% (5/5) / ⏱️ 15.7s / 💰 $0.09	🔴 0% (0/5) / ⏱️ 10.5s / 💰 $0.01	🟡 80% (⅘) / ⏱️ 26.1s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 10.9s / 💰 $0.09
73a_time_window_anomaly 🔗	🟡 40% (⅖) / ⏱️ 30.7s / 💰 $0.01	🟡 80% (⅘) / ⏱️ 28.0s / 💰 $0.01	🟡 80% (⅘) / ⏱️ 34.4s / 💰 $0.11	🟡 20% (⅕) / ⏱️ 14.4s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 35.4s / 💰 $0.10	🟡 20% (⅕) / ⏱️ 29.2s / 💰 $0.06	🟢 100% (5/5) / ⏱️ 70.1s / 💰 $0.38	🟢 100% (5/5) / ⏱️ 37.9s / 💰 $0.25	🟡 40% (⅖) / ⏱️ 28.0s / 💰 $0.02	🔴 0% (0/5) / ⏱️ 13.7s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 51.6s / 💰 $0.21
73b_time_window_anomaly 🔗	🟡 60% (⅗) / ⏱️ 40.0s / 💰 $0.01	🟢 100% (5/5) / ⏱️ 29.1s / 💰 $0.01	🟡 60% (⅗) / ⏱️ 43.4s / 💰 $0.14	🟡 40% (⅖) / ⏱️ 16.4s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 38.8s / 💰 $0.10	🟡 20% (⅕) / ⏱️ 22.8s / 💰 $0.05	🟢 100% (5/5) / ⏱️ 70.5s / 💰 $0.39	🟢 100% (5/5) / ⏱️ 36.0s / 💰 $0.26	🟡 40% (⅖) / ⏱️ 25.4s / 💰 $0.02	🔴 0% (0/5) / ⏱️ 11.4s / 💰 $0.00	🟢 100% (5/5) / ⏱️ 49.9s / 💰 $0.21
96_no_matching_skill 🔗	🟢 100% (5/5) / ⏱️ 63.6s / 💰 $0.02	🟢 100% (5/5) / ⏱️ 49.1s / 💰 $0.02	🟡 60% (⅗) / ⏱️ 81.6s / 💰 $0.33	🟡 60% (⅗) / ⏱️ 48.9s / 💰 $0.03	🟢 100% (5/5) / ⏱️ 46.8s / 💰 $0.15	🟢 100% (5/5) / ⏱️ 40.5s / 💰 $0.09	🟡 80% (⅘) / ⏱️ 129.5s / 💰 $0.55	🟢 100% (5/5) / ⏱️ 61.7s / 💰 $0.52	🟡 40% (⅖) / ⏱️ 44.9s / 💰 $0.05	🟡 20% (⅕) / ⏱️ 87.2s / 💰 $0.07	🟢 100% (5/5) / ⏱️ 59.9s / 💰 $0.26

Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-26358788343.