Adding a New Eval¶
Create test cases that measure HolmesGPT's diagnostic accuracy and help track improvements over time.
Prerequisites¶
Install HolmesGPT python dependencies:
Quick Start: Running Your First Eval¶
Try running an existing eval to understand how the system works. We'll use eval 80_pvc_storage_class_mismatch as an example:
# Run eval #80 with Claude Sonnet 4.5 (this specific eval passes reliably with Sonnet 4.5)
RUN_LIVE=true MODEL=anthropic/claude-sonnet-4-20250514 \
CLASSIFIER_MODEL=gpt-4.1 \
poetry run pytest tests/llm/test_ask_holmes.py -k "80_pvc_storage_class_mismatch"
# Compare with GPT-4o (may not pass as reliably)
RUN_LIVE=true MODEL=gpt-4o \
poetry run pytest tests/llm/test_ask_holmes.py -k "80_pvc_storage_class_mismatch"
# Compare with GPT-4.1 (may not pass as reliably)
RUN_LIVE=true MODEL=gpt-4.1 \
poetry run pytest tests/llm/test_ask_holmes.py -k "80_pvc_storage_class_mismatch"
# Test multiple models at once to compare performance
RUN_LIVE=true MODEL=gpt-4o,gpt-4.1,anthropic/claude-sonnet-4-20250514 \
CLASSIFIER_MODEL=gpt-4.1 \
poetry run pytest tests/llm/test_ask_holmes.py -k "80_pvc_storage_class_mismatch"
Note: Eval #80 demonstrates how different models perform differently - Sonnet 4.5 passes this specific eval reliably while weaker models like GPT-4o and GPT-4.1 may struggle with this scenario.
Quick Start¶
-
Create test folder:
tests/llm/fixtures/test_ask_holmes/99_your_test/ -
Create
test_case.yaml: -
Create
manifest.yamlwith your test scenario: -
Run test:
# With GPT-4.1 RUN_LIVE=true MODEL=gpt-4.1 \ poetry run pytest tests/llm/test_ask_holmes.py -k "99_your_test" -v # With Claude Sonnet 4.5 (must set CLASSIFIER_MODEL since Anthropic models can't be used as classifiers) RUN_LIVE=true MODEL=anthropic/claude-sonnet-4-20250514 \ CLASSIFIER_MODEL=gpt-4.1 \ poetry run pytest tests/llm/test_ask_holmes.py -k "99_your_test" -v
Note on CLASSIFIER_MODEL: An LLM judges whether tests pass. Only OpenAI models (like gpt-4.1) work as classifiers. Set CLASSIFIER_MODEL=gpt-4.1 explicitly when using Anthropic models. For OpenAI models, it defaults to MODEL.
test_case.yaml Configuration¶
Configure your test by defining these fields in test_case.yaml:
Required Fields¶
user_prompt: Question for Holmesexpected_output: List of required elements in responsebefore_test/after_test: Setup/teardown commands (run withRUN_LIVE=true)
Optional Fields¶
tags: List of test markers (e.g.,[easy, kubernetes, logs])skip: Boolean to skip testskip_reason: Explanation why test is skippedmocked_date: Override system time for test (e.g.,"2025-06-23T11:34:00Z")cluster_name: Specify kubernetes cluster nameinclude_files: List of files to include in context (like CLI's--includeflag)runbooks: Override runbook catalog:toolsets: Configure toolsets (can also use separatetoolsets.yamlfile):port_forwards: Configure port forwarding for teststest_env_vars: Environment variables during test executionconversation_history: For multi-turn conversation testsexpected_sections: For investigation tests only
Advanced Features¶
Toolsets Configuration¶
You can configure which toolsets are available during your test in two ways:
-
Inline in test_case.yaml:
-
Separate toolsets.yaml file (preferred for complex configurations):
Port Forwarding¶
Some tests require access to services that are not directly exposed. You can configure port forwards that will be automatically set up and torn down for your test:
port_forwards:
- namespace: app-01
service: rabbitmq
local_port: 15672
remote_port: 15672
- namespace: app-01
service: prometheus
local_port: 9090
remote_port: 9090
Note: Use unique local ports across all tests to avoid conflicts
Port forwards are:
- Automatically started before any tests run
- Shared across all tests in a session to avoid conflicts
- Always cleaned up after tests complete, even if tests are interrupted
- Run regardless of
--skip-setupor--skip-cleanupflags
Important notes:
- Use unique local ports across all tests to avoid conflicts
- Port forwards persist for the entire test session
- If a port is already in use, the test will fail with helpful debugging information
- Use
lsof -ti :<port>to find processes using a port - Port forwards work with all test modes
Toolset Configuration¶
Create toolsets.yaml to customize available tools:
toolsets:
prometheus/metrics:
enabled: true
config:
prometheus_url: "http://custom-prometheus:9090"
grafana/dashboards:
enabled: false # Disable specific toolsets
Custom Runbooks¶
runbooks:
catalog:
- description: "DNS troubleshooting"
link: "dns-runbook.md" # Place .md file in test directory
Options:
- No field: Use default runbooks
runbooks: {}: No runbooks availablerunbooks: {catalog: [...]}: Custom catalog
Tagging¶
Evals support tags for organization, filtering, and reporting purposes. Tags help categorize tests by their characteristics and enable selective test execution.
Available Tags¶
The valid tags are defined in the test constants file in the repository.
Some examples
logs- Tests HolmesGPT's ability to find and interpret logs correctlycontext_window- Tests handling of data that exceeds the LLM's context windowdatetime- Tests date/time handling and interpretation- etc.