HolmesGPT Evaluations¶
We use 150+ evaluations ('evals' for short) to benchmark HolmesGPT, map out areas for improvement, and compare performance across different models.
We also use the evals as regression tests on every commit.
View latest evaluation results →
Test Categories¶
- Regression tests (
easy
): Scenarios that must always pass - Advanced tests (
medium
andhard
): More challenging scenarios - Specialized tests: Focused on specific capabilities (logs, kubernetes, prometheus, etc.)
Quick Start¶
Running Evaluations¶
# Prerequisites
poetry install --with=dev
# Run regression tests (should always pass)
RUN_LIVE=true poetry run pytest -m 'llm and easy' --no-cov
# Run specific test
RUN_LIVE=true poetry run pytest tests/llm/test_ask_holmes.py -k "01_how_many_pods"
# Run with multiple iterations for reliable results
RUN_LIVE=true ITERATIONS=10 poetry run pytest -m 'llm and easy'
→ Complete guide to running evaluations
Adding New Tests¶
Create test scenarios to improve coverage:
# test_case.yaml
user_prompt: 'Is the nginx pod healthy?'
expected_output:
- nginx pod is healthy
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml
→ Guide to adding new evaluations
Analyzing Results¶
Track and debug evaluation results with Braintrust:
→ Reporting and analysis guide
Automated Benchmarking¶
Our CI/CD pipeline runs evaluations automatically:
- Weekly - Every Sunday at 2 AM UTC (comprehensive testing with 10 iterations)
- Pull Requests - When eval-related files are modified (quick validation)
- On-demand - Via GitHub Actions UI
Results are published here and archived in history.
Model Comparison¶
Compare different LLMs to find the best for your use case:
# Test multiple models in one run
RUN_LIVE=true MODEL=gpt-4o,anthropic/claude-sonnet-4-20250514 \
CLASSIFIER_MODEL=gpt-4o \
poetry run pytest -m 'llm and easy'
See the latest results for current model performance comparisons.
Resources¶
- Running Evaluations - Complete guide to running tests
- Adding New Evaluations - Contribute test scenarios
- Reporting with Braintrust - Analyze results in detail
- Historical Results - Past benchmark data