September 28, 2025¶
Generated: 2025-09-29 10:49 UTC
Total Duration: 1h 4m 41s
Iterations: 1
Judge (classifier) model: gpt-4.1
About this Benchmark¶
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| gpt-4o | 58 | 35 | 12 | 105 | 🟡 62% (58/93) |
| gpt-4.1 | 72 | 22 | 11 | 105 | 🟡 77% (72/94) |
| gpt-5 | 76 | 17 | 12 | 105 | 🟡 82% (76/93) |
| sonnet-4-20250514 | 88 | 6 | 11 | 105 | 🟡 94% (88/94) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| gpt-4o | 93 | $0.14 | $0.03 | $0.27 | $12.83 |
| gpt-4.1 | 94 | $0.09 | $0.03 | $0.41 | $8.69 |
| gpt-5 | 93 | $0.13 | $0.02 | $0.39 | $12.35 |
| sonnet-4-20250514 | 94 | $0.16 | $0.06 | $0.50 | $15.47 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| gpt-4o | 35.1 | 9.7 | 84.7 | 35.2 | 48.6 |
| gpt-4.1 | 35.0 | 7.0 | 80.4 | 34.6 | 58.5 |
| gpt-5 | 189.5 | 24.1 | 677.8 | 159.7 | 464.0 |
| sonnet-4-20250514 | 67.6 | 10.7 | 210.1 | 55.2 | 150.5 |
Performance by Tag¶
Success rate by test category and model:
| Tag | gpt-4o | gpt-4.1 | gpt-5 | sonnet-4-20250514 | Warnings |
|---|---|---|---|---|---|
| chain-of-causation | 🔴 0% (0/6) | 🔴 0% (0/6) | 🟡 67% (4/6) | 🟡 83% (⅚) | ⚠️ 8 skipped |
| context_window | 🟡 57% (4/7) | 🟡 71% (5/7) | 🟢 100% (7/7) | 🟡 86% (6/7) | |
| counting | 🟢 100% (4/4) | 🟢 100% (4/4) | 🟢 100% (4/4) | 🟢 100% (4/4) | |
| database | 🔴 0% (0/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | ⚠️ 12 skipped |
| datadog | 🟡 75% (¾) | 🟢 100% (4/4) | 🟡 75% (¾) | 🟢 100% (4/4) | |
| datetime | 🟡 75% (¾) | 🟡 50% (2/4) | 🟢 100% (4/4) | 🟡 75% (¾) | ⚠️ 8 skipped |
| easy | 🟡 94% (34/36) | 🟡 97% (35/36) | 🟡 78% (28/36) | 🟢 100% (36/36) | |
| hard | 🟡 14% (2/14) | 🟡 29% (4/14) | 🟡 57% (8/14) | 🟡 86% (12/14) | ⚠️ 24 skipped |
| kafka | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - | ⚠️ 8 skipped |
| kubernetes | 🟡 55% (26/47) | 🟡 77% (36/47) | 🟡 81% (38/47) | 🟡 94% (44/47) | ⚠️ 4 skipped |
| logs | 🟡 65% (17/26) | 🟡 73% (19/26) | 🟡 85% (22/26) | 🟡 85% (22/26) | ⚠️ 28 skipped |
| medium | 🟡 51% (22/43) | 🟡 75% (33/44) | 🟡 93% (40/43) | 🟡 91% (40/44) | ⚠️ 22 skipped |
| network | 🟡 25% (¼) | 🟡 75% (¾) | 🟢 100% (4/4) | 🟢 100% (4/4) | |
| numerical | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| port-forward | 🟡 33% (3/9) | 🟡 56% (5/9) | 🟡 78% (7/9) | 🟡 78% (7/9) | |
| prometheus | 🟡 50% (2/4) | 🟢 100% (4/4) | 🟢 100% (4/4) | 🟢 100% (4/4) | |
| question-answer | 🟢 100% (4/4) | 🟢 100% (4/4) | 🟢 100% (4/4) | 🟢 100% (4/4) | |
| runbooks | 🟡 33% (2/6) | 🟡 83% (⅚) | 🟢 100% (6/6) | 🟢 100% (6/6) | ⚠️ 4 skipped |
| slackbot | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - | ⚠️ 4 skipped |
| traces | 🔴 0% (0/5) | 🔴 0% (0/5) | 🟡 60% (⅗) | 🟡 80% (⅘) | |
| transparency | 🟡 50% (7/14) | 🟡 71% (10/14) | 🟡 86% (12/14) | 🟡 86% (12/14) | ⚠️ 4 skipped |
| Overall | 🟡 62% (58/93) | 🟡 77% (72/94) | 🟡 82% (76/93) | 🟡 94% (88/94) | ⚠️ 46 skipped |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
Detailed Raw Results¶
| Eval ID | gpt-4o | gpt-4.1 | gpt-5 | sonnet-4-20250514 |
|---|---|---|---|---|
| 01_how_many_pods 🔗 | 🟢 100% (1/1) / ⏱️ 28.8s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 27.1s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 39.1s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 26.5s / 💰 $0.08 |
| 02_what_is_wrong_with_pod 🔗 | 🟢 100% (1/1) / ⏱️ 27.1s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 28.3s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 179.2s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 44.1s / 💰 $0.11 |
| 03_what_is_the_command_to_port_forward 🔗 | 🟢 100% (1/1) / ⏱️ 33.5s / 💰 $0.16 | 🟢 100% (1/1) / ⏱️ 31.0s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 119.7s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 38.3s / 💰 $0.12 |
| 04_related_k8s_events 🔗 | 🟢 100% (1/1) / ⏱️ 31.1s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 29.7s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 84.5s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 44.0s / 💰 $0.09 |
| 05_image_version 🔗 | 🟢 100% (1/1) / ⏱️ 27.2s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 23.0s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 65.4s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 32.1s / 💰 $0.09 |
| 09_crashpod 🔗 | 🟢 100% (1/1) / ⏱️ 32.3s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 31.5s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 74.4s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 54.8s / 💰 $0.16 |
| 100a_historical_logs 🔗 | 🟢 100% (1/1) / ⏱️ 44.8s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 40.6s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 388.1s / 💰 $0.29 | 🟢 100% (1/1) / ⏱️ 130.5s / 💰 $0.28 |
| 100b_historical_logs_nonstandard_label 🔗 | 🔴 0% (0/1) / ⏱️ 36.1s / 💰 $0.15 | 🔴 0% (0/1) / ⏱️ 34.6s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 354.2s / 💰 $0.16 | 🔴 0% (0/1) / ⏱️ 151.0s / 💰 $0.23 |
| 101_historical_logs_pod_deleted 🔗 | 🔴 0% (0/1) / ⏱️ 40.2s / 💰 $0.12 | 🔴 0% (0/1) / ⏱️ 31.4s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 453.0s / 💰 $0.33 | 🔴 0% (0/1) / ⏱️ 139.6s / 💰 $0.46 |
| 103_logs_transparency_default_limit 🔗 | 🔴 0% (0/1) / ⏱️ 33.4s / 💰 $0.14 | 🔴 0% (0/1) / ⏱️ 46.9s / 💰 $0.41 | 🟢 100% (1/1) / ⏱️ 76.1s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 54.3s / 💰 $0.41 |
| 104a_postgres_root_issue 🔗 | 🔴 0% (0/1) / ⏱️ 36.3s / 💰 $0.17 | 🔴 0% (0/1) / ⏱️ 55.4s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 243.1s / 💰 $0.23 | 🟢 100% (1/1) / ⏱️ 92.4s / 💰 $0.21 |
| 107_log_filter_http_status_code 🔗 | 🟢 100% (1/1) / ⏱️ 37.0s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 38.2s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 304.7s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 80.1s / 💰 $0.19 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/1) / ⏱️ 38.9s / 💰 $0.15 | 🔴 0% (0/1) / ⏱️ 41.4s / 💰 $0.14 | 🔴 0% (0/1) / ⏱️ 417.1s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 77.5s / 💰 $0.22 |
| 109_logs_transparency_not_found 🔗 | 🔴 0% (0/1) / ⏱️ 27.7s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 31.6s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 121.1s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 36.3s / 💰 $0.09 |
| 10_image_pull_backoff 🔗 | 🟢 100% (1/1) / ⏱️ 40.8s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 28.4s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 45.5s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 50.0s / 💰 $0.11 |
| 110_k8s_events_image_pull 🔗 | 🟢 100% (1/1) / ⏱️ 32.6s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 34.6s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 55.4s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 43.9s / 💰 $0.11 |
| 111_disabled_datadog_traces 🔗 | 🔴 0% (0/1) / ⏱️ 28.8s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 20.1s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 161.4s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 78.5s / 💰 $0.21 |
| 111_pod_names_contain_service 🔗 | 🟢 100% (1/1) / ⏱️ 34.3s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 38.4s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 207.8s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 71.3s / 💰 $0.21 |
| 112_find_pvcs_by_uuid 🔗 | 🔴 0% (0/1) / ⏱️ 30.2s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 39.4s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 100.6s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 49.9s / 💰 $0.14 |
| 114_checkout_latency_tracing_rebuild[0] 🔗 | 🔴 0% (0/1) / ⏱️ 40.2s / 💰 $0.25 | 🔴 0% (0/1) / ⏱️ 44.6s / 💰 $0.17 | 🔴 0% (0/1) / ⏱️ 443.6s / 💰 $0.36 | 🟢 100% (1/1) / ⏱️ 120.6s / 💰 $0.36 |
| 115_checkout_errors_tracing[0] 🔗 | 🔴 0% (0/1) / ⏱️ 43.1s / 💰 $0.25 | 🔴 0% (0/1) / ⏱️ 64.1s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 193.9s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 109.7s / 💰 $0.35 |
| 11_init_containers 🔗 | 🟢 100% (1/1) / ⏱️ 32.7s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 33.6s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 26.2s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 56.7s / 💰 $0.13 |
| 121_new_relic_checkout_errors_tracing[0] 🔗 | 🔴 0% (0/1) / ⏱️ 29.9s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 25.9s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 565.1s / 💰 $0.31 | 🔴 0% (0/1) / ⏱️ 141.8s / 💰 $0.28 |
| 122_new_relic_checkout_latency_tracing_rebuild[0] 🔗 | 🔴 0% (0/1) / ⏱️ 36.9s / 💰 $0.20 | 🔴 0% (0/1) / ⏱️ 40.9s / 💰 $0.12 | 🔴 0% (0/1) / ⏱️ 677.8s / 💰 $0.39 | 🟢 100% (1/1) / ⏱️ 118.7s / 💰 $0.33 |
| 123_new_relic_checkout_errors_tracing[0] 🔗 | 🔴 0% (0/1) / ⏱️ 32.5s / 💰 $0.13 | 🔴 0% (0/1) / ⏱️ 22.5s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 577.4s / 💰 $0.32 | 🟢 100% (1/1) / ⏱️ 97.5s / 💰 $0.29 |
| 12_job_crashing 🔗 | 🟢 100% (1/1) / ⏱️ 36.6s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 33.3s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 54.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 55.2s / 💰 $0.12 |
| 13a_pending_node_selector_basic 🔗 | 🟢 100% (1/1) / ⏱️ 35.2s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 50.5s / 💰 $0.08 | 🔴 0% (0/1) / ⏱️ 27.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 51.6s / 💰 $0.13 |
| 13b_pending_node_selector_detailed 🔗 | 🔴 0% (0/1) / ⏱️ 33.6s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 36.3s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 314.9s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 50.0s / 💰 $0.13 |
| 14_pending_resources 🔗 | 🟢 100% (1/1) / ⏱️ 37.6s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 37.6s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 39.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 56.2s / 💰 $0.12 |
| 159_prometheus_high_cardinality_cpu[0] 🔗 | 🟢 100% (1/1) / ⏱️ 30.4s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 58.5s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 304.1s / 💰 $0.22 | 🟢 100% (1/1) / ⏱️ 55.1s / 💰 $0.18 |
| 159_prometheus_high_cardinality_cpu[1] 🔗 | 🔴 0% (0/1) / ⏱️ 48.6s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 34.3s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 358.2s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 135.5s / 💰 $0.21 |
| 159_prometheus_high_cardinality_cpu[2] 🔗 | 🔴 0% (0/1) / ⏱️ 38.4s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 51.4s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 119.6s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 69.2s / 💰 $0.21 |
| 15_failed_readiness_probe 🔗 | 🟢 100% (1/1) / ⏱️ 44.0s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 32.3s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 236.9s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 132.8s / 💰 $0.15 |
| 16_failed_no_toolset_found 🔗 | 🔴 0% (0/1) / ⏱️ 24.4s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 23.7s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 36.9s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 22.3s / 💰 $0.06 |
| 17_oom_kill 🔗 | 🟢 100% (1/1) / ⏱️ 38.3s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 31.4s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 78.0s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 55.1s / 💰 $0.12 |
| 19_detect_missing_app_details 🔗 | 🔴 0% (0/1) / ⏱️ 50.5s / 💰 $0.22 | 🔴 0% (0/1) / ⏱️ 43.3s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 264.1s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 95.7s / 💰 $0.16 |
| 20_long_log_file_search 🔗 | 🟢 100% (1/1) / ⏱️ 39.1s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 42.0s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 97.6s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 77.2s / 💰 $0.12 |
| 21_job_fail_curl_no_svc_account 🔗 | 🟢 100% (1/1) / ⏱️ 43.9s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 38.2s / 💰 $0.14 | 🔴 0% (0/1) / ⏱️ 26.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 54.3s / 💰 $0.22 |
| 23_app_error_in_current_logs 🔗 | 🟢 100% (1/1) / ⏱️ 40.1s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 36.2s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 283.2s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 72.4s / 💰 $0.50 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (1/1) / ⏱️ 39.7s / 💰 $0.23 | 🟢 100% (1/1) / ⏱️ 40.9s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 24.1s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 61.0s / 💰 $0.16 |
| 24a_misconfigured_pvc_basic 🔗 | 🟢 100% (1/1) / ⏱️ 40.4s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 67.0s / 💰 $0.23 | 🔴 0% (0/1) / ⏱️ 29.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 79.2s / 💰 $0.15 |
| 24b_misconfigured_pvc_detailed 🔗 | 🔴 0% (0/1) / ⏱️ 40.3s / 💰 $0.17 | 🔴 0% (0/1) / ⏱️ 37.2s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 29.3s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 64.2s / 💰 $0.14 |
| 25_misconfigured_ingress_class 🔗 | 🔴 0% (0/1) / ⏱️ 39.2s / 💰 $0.14 | 🔴 0% (0/1) / ⏱️ 45.1s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 296.2s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 117.6s / 💰 $0.31 |
| 26_page_render_times 🔗 | 🟢 100% (1/1) / ⏱️ 30.2s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 30.7s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 227.4s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 57.6s / 💰 $0.15 |
| 27a_multi_container_logs 🔗 | 🟢 100% (1/1) / ⏱️ 35.0s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 36.3s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 201.3s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 43.9s / 💰 $0.13 |
| 27b_multi_container_logs 🔗 | 🟢 100% (1/1) / ⏱️ 32.5s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 39.0s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 154.3s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 37.8s / 💰 $0.11 |
| 28_permissions_error 🔗 | 🟢 100% (1/1) / ⏱️ 21.2s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 23.9s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 124.6s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 25.7s / 💰 $0.07 |
| 33_cpu_metrics_discovery 🔗 | 🟢 100% (1/1) / ⏱️ 27.6s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 37.9s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 246.8s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 48.4s / 💰 $0.13 |
| 39_failed_toolset 🔗 | 🟢 100% (1/1) / ⏱️ 27.7s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 26.9s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 222.1s / 💰 $0.16 | 🟢 100% (1/1) / ⏱️ 191.5s / 💰 $0.12 |
| 41_setup_argo 🔗 | 🟢 100% (1/1) / ⏱️ 19.4s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 31.5s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 170.7s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 19.0s / 💰 $0.06 |
| 42_dns_issues_result_new_tools_no_runbook 🔗 | 🔴 0% (0/1) / ⏱️ 34.9s / 💰 $0.24 | 🟢 100% (1/1) / ⏱️ 35.3s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 564.4s / 💰 $0.25 | 🟢 100% (1/1) / ⏱️ 140.3s / 💰 $0.20 |
| 42_dns_issues_steps_new_tools 🔗 | 🟢 100% (1/1) / ⏱️ 84.7s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 37.1s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 464.0s / 💰 $0.24 | 🟢 100% (1/1) / ⏱️ 210.1s / 💰 $0.31 |
| 43_current_datetime_from_prompt 🔗 | 🟢 100% (1/1) / ⏱️ 28.4s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 20.9s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 91.9s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 18.1s / 💰 $0.06 |
| 45_fetch_deployment_logs_simple 🔗 | 🟢 100% (1/1) / ⏱️ 29.5s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 32.5s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 108.3s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 35.9s / 💰 $0.09 |
| 50_logs_since_specific_date 🔗 | 🟢 100% (1/1) / ⏱️ 16.6s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 19.6s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 136.4s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 27.8s / 💰 $0.11 |
| 50a_logs_since_last_specific_month 🔗 | 🟢 100% (1/1) / ⏱️ 29.5s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 28.1s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 110.7s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 41.4s / 💰 $0.10 |
| 51_logs_summarize_errors 🔗 | 🟢 100% (1/1) / ⏱️ 29.4s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 32.8s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 89.4s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 42.8s / 💰 $0.10 |
| 52_logs_login_issues 🔗 | 🔴 0% (0/1) / ⏱️ 33.6s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 48.1s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 339.9s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 75.4s / 💰 $0.12 |
| 53_logs_find_term 🔗 | 🟢 100% (1/1) / ⏱️ 30.7s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 40.7s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 81.5s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 46.2s / 💰 $0.13 |
| 54_not_truncated_when_getting_pods 🔗 | 🟢 100% (1/1) / ⏱️ 36.5s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 40.5s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 188.7s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 69.8s / 💰 $0.16 |
| 57_wrong_namespace 🔗 | 🔴 0% (0/1) / ⏱️ 27.1s / 💰 $0.10 | 🔴 0% (0/1) / ⏱️ 27.8s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 111.4s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 41.5s / 💰 $0.10 |
| 59_label_based_counting 🔗 | 🟢 100% (1/1) / ⏱️ 26.1s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 27.1s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 47.0s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 166.8s / 💰 $0.08 |
| 60_count_less_than 🔗 | 🟢 100% (1/1) / ⏱️ 23.3s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 23.4s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 53.1s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 26.2s / 💰 $0.08 |
| 61_exact_match_counting 🔗 | 🟢 100% (1/1) / ⏱️ 36.2s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 25.4s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 55.7s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 29.5s / 💰 $0.07 |
| 62_fetch_error_logs_with_errors 🔗 | 🟢 100% (1/1) / ⏱️ 29.9s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 29.7s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 96.1s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 39.3s / 💰 $0.09 |
| 63_fetch_error_logs_no_errors 🔗 | 🟢 100% (1/1) / ⏱️ 38.5s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 29.9s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 100.5s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 37.1s / 💰 $0.09 |
| 64_keda_vs_hpa_confusion 🔗 | 🟢 100% (1/1) / ⏱️ 55.0s / 💰 $0.27 | 🔴 0% (0/1) / ⏱️ 31.9s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 241.5s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 72.1s / 💰 $0.25 |
| 65_health_check_followup 🔗 | 🟢 100% (1/1) / ⏱️ 35.2s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 40.6s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 224.3s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 76.3s / 💰 $0.24 |
| 71_connection_pool_starvation 🔗 | 🟢 100% (1/1) / ⏱️ 28.6s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 37.1s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 161.5s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 57.2s / 💰 $0.15 |
| 73a_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 47.4s / 💰 $0.21 | 🔴 0% (0/1) / ⏱️ 27.6s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 113.4s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 60.6s / 💰 $0.14 |
| 73b_time_window_anomaly 🔗 | 🔴 0% (0/1) / ⏱️ 40.6s / 💰 $0.17 | 🔴 0% (0/1) / ⏱️ 43.6s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 136.4s / 💰 $0.08 | 🔴 0% (0/1) / ⏱️ 55.1s / 💰 $0.13 |
| 76_service_discovery_issue 🔗 | 🔴 0% (0/1) / ⏱️ 34.5s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 45.6s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 231.2s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 68.2s / 💰 $0.20 |
| 77_liveness_probe_misconfiguration 🔗 | 🔴 0% (0/1) / ⏱️ 43.4s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 30.3s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 101.5s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 48.1s / 💰 $0.13 |
| 78a_missing_cpu_limits 🔗 | 🟢 100% (1/1) / ⏱️ 42.2s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 34.6s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 260.0s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 55.1s / 💰 $0.13 |
| 78b_cpu_quota_exceeded 🔗 | 🔴 0% (0/1) / ⏱️ 42.7s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 46.1s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 171.3s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 72.2s / 💰 $0.13 |
| 79_configmap_mount_issue 🔗 | 🟢 100% (1/1) / ⏱️ 28.6s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 32.6s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 153.7s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 49.2s / 💰 $0.12 |
| 80_pvc_storage_class_mismatch 🔗 | 🔴 0% (0/1) / ⏱️ 29.7s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 37.9s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 159.7s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 51.8s / 💰 $0.12 |
| 81_service_account_permission_denied 🔗 | 🟢 100% (1/1) / ⏱️ 35.8s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 43.6s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 165.4s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 67.1s / 💰 $0.20 |
| 82_pod_anti_affinity_conflict 🔗 | 🟢 100% (1/1) / ⏱️ 35.9s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 35.0s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 191.3s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 59.3s / 💰 $0.14 |
| 83_secret_not_found 🔗 | 🟢 100% (1/1) / ⏱️ 32.0s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 37.5s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 132.9s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 52.2s / 💰 $0.11 |
| 84_network_policy_blocking_traffic 🔗 | 🔴 0% (0/1) / ⏱️ 35.2s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 39.6s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 157.9s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 84.8s / 💰 $0.22 |
| 85_hpa_not_scaling 🔗 | 🔴 0% (0/1) / ⏱️ 34.6s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 42.6s / 💰 $0.16 | 🟢 100% (1/1) / ⏱️ 195.0s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 62.4s / 💰 $0.17 |
| 86_configmap_like_but_secret 🔗 | 🔴 0% (0/1) / ⏱️ 44.9s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 43.9s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 184.3s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 46.7s / 💰 $0.12 |
| 89_runbook_missing_cloudwatch 🔗 | 🔴 0% (0/1) / ⏱️ 30.8s / 💰 $0.16 | 🟢 100% (1/1) / ⏱️ 22.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 315.4s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 42.8s / 💰 $0.10 |
| 90_runbook_basic_selection 🔗 | 🔴 0% (0/1) / ⏱️ 47.0s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 46.1s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 383.5s / 💰 $0.32 | 🟢 100% (1/1) / ⏱️ 150.5s / 💰 $0.13 |
| 91f_datadog_logs_historical_pod 🔗 | 🔴 0% (0/1) / ⏱️ 38.9s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 80.4s / 💰 $0.36 | 🔴 0% (0/1) / ⏱️ 434.8s / 💰 $0.31 | 🟢 100% (1/1) / ⏱️ 69.6s / 💰 $0.15 |
| 93_calling_datadog[0] 🔗 | 🟢 100% (1/1) / ⏱️ 48.5s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 10.8s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 102.7s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 10.9s / 💰 $0.15 |
| 93_calling_datadog[1] 🔗 | 🟢 100% (1/1) / ⏱️ 45.3s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 8.7s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 40.1s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 10.7s / 💰 $0.15 |
| 93_calling_datadog[2] 🔗 | 🟢 100% (1/1) / ⏱️ 47.0s / 💰 $0.15 | 🟢 100% (1/1) / ⏱️ 9.8s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 42.7s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 15.2s / 💰 $0.15 |
| 94_runbook_transparency 🔗 | 🟢 100% (1/1) / ⏱️ 52.7s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 35.5s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 359.0s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 83.1s / 💰 $0.23 |
| 96_no_matching_runbook 🔗 | 🔴 0% (0/1) / ⏱️ 36.3s / 💰 $0.16 | 🔴 0% (0/1) / ⏱️ 60.8s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 383.2s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 90.1s / 💰 $0.26 |
| 97_logs_clarification_needed 🔗 | 🟢 100% (1/1) / ⏱️ 15.2s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 19.3s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 26.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 131.5s / 💰 $0.19 |
| 99_logs_transparency_custom_time 🔗 | 🟢 100% (1/1) / ⏱️ 45.3s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 33.6s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 67.9s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 43.0s / 💰 $0.10 |
| 93_events_since_specific_date 🔗 | ⚪️ - | 🟢 100% (1/1) / ⏱️ 13.2s / 💰 $0.07 | ⚪️ - | 🟢 100% (1/1) / ⏱️ 16.3s / 💰 $0.10 |
| 44_slack_statefulset_logs 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 48_logs_since_thursday 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 22_high_latency_dbi_down 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 08_sock_shop_frontend 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 104b_postgres_missing_index_pgstat 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 104c_postgres_minimal_missing_index 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 105_redis_wrong_data_structure 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 156_kafka_opensearch_latency 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 43_slack_deployment_logs 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 55_kafka_runbook 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
| 98_logs_transparency_default_time 🔗 | ⚪️ - | ⚪️ - | ⚪️ - | ⚪️ - |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: local-benchmark-20250927-230943.