Skip to content

HolmesGPT Evaluations

We use 150+ evaluations ('evals' for short) to benchmark HolmesGPT, map out areas for improvement, and compare performance across different models.

We also use the evals as regression tests on every commit.

View latest evaluation results →

Benchmark Types

We run two types of benchmarks to balance speed and coverage:

Benchmark Markers Purpose Schedule
âš¡ Fast regression or benchmark Quick regression tests to catch breaking changes Weekly (Sunday 2 AM UTC)
Full easy or medium or hard or regression or benchmark Comprehensive testing across all difficulty levels Manual / On-demand

All results are stored in History. Fast benchmark results are marked with âš¡.

Test Categories

  • Regression tests (regression): Critical scenarios that must always pass
  • Benchmark tests (benchmark): Tests included in the fast benchmark for quick validation
  • Easy tests (easy): Straightforward scenarios for baseline validation
  • Medium tests (medium): Moderately complex troubleshooting scenarios
  • Hard tests (hard): Challenging multi-step investigations
  • Specialized tests: Focused on specific capabilities (logs, kubernetes, prometheus, etc.)

Quick Start

Running Evaluations

# Prerequisites
poetry install --with=dev

# Run regression tests (should always pass)
RUN_LIVE=true poetry run pytest -m 'llm and easy' --no-cov

# Run specific test
RUN_LIVE=true poetry run pytest tests/llm/test_ask_holmes.py -k "01_how_many_pods"

# Run with multiple iterations for reliable results
RUN_LIVE=true ITERATIONS=10 poetry run pytest -m 'llm and easy'

→ Complete guide to running evaluations

Adding New Tests

Create test scenarios to improve coverage:

# test_case.yaml
user_prompt: 'Is the nginx pod healthy?'
expected_output:
  - nginx pod is healthy
before_test: kubectl apply -f ./manifest.yaml
after_test: kubectl delete -f ./manifest.yaml

→ Guide to adding new evaluations

Analyzing Results

Track and debug evaluation results with Braintrust:

export BRAINTRUST_API_KEY=your-key
RUN_LIVE=true poetry run pytest -m 'llm and easy'

→ Reporting and analysis guide

Automated Benchmarking

Our CI/CD pipeline runs evaluations automatically:

  • Weekly - Every Sunday at 2 AM UTC (comprehensive testing with 10 iterations)
  • Pull Requests - When eval-related files are modified (quick validation)
  • On-demand - Via GitHub Actions UI

Results are published here and archived in history.

Model Comparison

Compare different LLMs to find the best for your use case:

# Test multiple models in one run
RUN_LIVE=true MODEL=gpt-4o,anthropic/claude-sonnet-4-20250514 \
  CLASSIFIER_MODEL=gpt-4o \
  poetry run pytest -m 'llm and easy'

See the latest results for current model performance comparisons.

Benchmarking New Models

Want to test a new LLM model? Follow our step-by-step guide:

→ Guide to benchmarking new models

Resources