HolmesGPT Evaluations¶
We use 150+ evaluations ('evals' for short) to benchmark HolmesGPT, map out areas for improvement, and compare performance across different models.
We also use the evals as regression tests on every commit.
View latest evaluation results →
Benchmark Types¶
We run two types of benchmarks to balance speed and coverage:
| Benchmark | Markers | Purpose | Schedule |
|---|---|---|---|
| âš¡ Fast | regression or benchmark |
Quick regression tests to catch breaking changes | Weekly (Sunday 2 AM UTC) |
| Full | easy or medium or hard or regression or benchmark |
Comprehensive testing across all difficulty levels | Manual / On-demand |
All results are stored in History. Fast benchmark results are marked with âš¡.
Test Categories¶
- Regression tests (
regression): Critical scenarios that must always pass - Benchmark tests (
benchmark): Tests included in the fast benchmark for quick validation - Easy tests (
easy): Straightforward scenarios for baseline validation - Medium tests (
medium): Moderately complex troubleshooting scenarios - Hard tests (
hard): Challenging multi-step investigations - Specialized tests: Focused on specific capabilities (logs, kubernetes, prometheus, etc.)
Quick Start¶
Running Evaluations¶
# Prerequisites
poetry install --with=dev
# Run regression tests (should always pass)
RUN_LIVE=true poetry run pytest -m 'llm and easy' --no-cov
# Run specific test
RUN_LIVE=true poetry run pytest tests/llm/test_ask_holmes.py -k "01_how_many_pods"
# Run with multiple iterations for reliable results
RUN_LIVE=true ITERATIONS=10 poetry run pytest -m 'llm and easy'
→ Complete guide to running evaluations
Adding New Tests¶
Create test scenarios to improve coverage:
# test_case.yaml
user_prompt: 'Is the nginx pod healthy?'
expected_output:
- nginx pod is healthy
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml
→ Guide to adding new evaluations
Analyzing Results¶
Track and debug evaluation results with Braintrust:
→ Reporting and analysis guide
Automated Benchmarking¶
Our CI/CD pipeline runs evaluations automatically:
- Weekly - Every Sunday at 2 AM UTC (comprehensive testing with 10 iterations)
- Pull Requests - When eval-related files are modified (quick validation)
- On-demand - Via GitHub Actions UI
Results are published here and archived in history.
Model Comparison¶
Compare different LLMs to find the best for your use case:
# Test multiple models in one run
RUN_LIVE=true MODEL=gpt-4o,anthropic/claude-sonnet-4-20250514 \
CLASSIFIER_MODEL=gpt-4o \
poetry run pytest -m 'llm and easy'
See the latest results for current model performance comparisons.
Benchmarking New Models¶
Want to test a new LLM model? Follow our step-by-step guide:
→ Guide to benchmarking new models
Resources¶
- Running Evaluations - Complete guide to running tests
- Adding New Evaluations - Contribute test scenarios
- Benchmarking New Models - Step-by-step guide for testing new LLM models
- Reporting with Braintrust - Analyze results in detail
- Historical Results - Past benchmark data