HolmesGPT Evaluations¶

We use 150+ evaluations ('evals' for short) to benchmark HolmesGPT, map out areas for improvement, and compare performance across different models.

We also use the evals as regression tests on every commit.

View latest evaluation results →

Benchmark Types¶

We run two types of benchmarks to balance speed and coverage:

Benchmark	Markers	Purpose	Schedule
⚡ Fast	`regression or benchmark`	Quick regression tests to catch breaking changes	Weekly (Sunday 2 AM UTC)
Full	`easy or medium or hard or regression or benchmark`	Comprehensive testing across all difficulty levels	Manual / On-demand

All results are stored in History. Fast benchmark results are marked with ⚡.

Test Categories¶

Regression tests (regression): Critical scenarios that must always pass
Benchmark tests (benchmark): Tests included in the fast benchmark for quick validation
Easy tests (easy): Straightforward scenarios for baseline validation
Medium tests (medium): Moderately complex troubleshooting scenarios
Hard tests (hard): Challenging multi-step investigations
Specialized tests: Focused on specific capabilities (logs, kubernetes, prometheus, etc.)

Quick Start¶

Running Evaluations¶

# Prerequisites
poetry install --with=dev

# Run regression tests (should always pass)
RUN_LIVE=true poetry run pytest -m 'llm and easy' --no-cov

# Run specific test
RUN_LIVE=true poetry run pytest tests/llm/test_ask_holmes.py -k "01_how_many_pods"

# Run with multiple iterations for reliable results
RUN_LIVE=true ITERATIONS=10 poetry run pytest -m 'llm and easy'

→ Complete guide to running evaluations

Adding New Tests¶

Create test scenarios to improve coverage:

# test_case.yaml
user_prompt: 'Is the nginx pod healthy?'
expected_output:
  - nginx pod is healthy
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml

→ Guide to adding new evaluations

Analyzing Results¶

Track and debug evaluation results with Braintrust:

export BRAINTRUST_API_KEY=your-key
RUN_LIVE=true poetry run pytest -m 'llm and easy'

→ Reporting and analysis guide

Automated Benchmarking¶

Our CI/CD pipeline runs evaluations automatically:

Weekly - Every Sunday at 2 AM UTC (comprehensive testing with 10 iterations)
Pull Requests - When eval-related files are modified (quick validation)
On-demand - Via GitHub Actions UI

Results are published here and archived in history.

Model Comparison¶

Compare different LLMs to find the best for your use case:

# Test multiple models in one run
RUN_LIVE=true MODEL=gpt-4o,anthropic/claude-sonnet-4-20250514 \
  CLASSIFIER_MODEL=gpt-4o \
  poetry run pytest -m 'llm and easy'

See the latest results for current model performance comparisons.

Benchmarking New Models¶

Want to test a new LLM model? Follow our step-by-step guide:

→ Guide to benchmarking new models

Resources¶

Running Evaluations - Complete guide to running tests
Adding New Evaluations - Contribute test scenarios
Benchmarking New Models - Step-by-step guide for testing new LLM models
Reporting with Braintrust - Analyze results in detail
Historical Results - Past benchmark data