Using Mock Data in LLM Tests¶
This document describes mock data usage in HolmesGPT's LLM evaluation tests. Live evaluations (RUN_LIVE=true
) are strongly preferred because they're more reliable and accurate.
Why Live Evaluations Are Preferred¶
LLMs can take multiple paths to reach the same conclusion. When using mock data: - The LLM might call tools in a different order than when mocks were generated - It might use different tool combinations to diagnose the same issue - It might ask for additional information not captured in the mocks - Mock data represents only one possible investigation path
With live evaluations, the LLM can explore any path it chooses, making tests more robust and realistic.
When Mock Data Is Necessary¶
Mock data is sometimes unavoidable: - CI/CD environments without Kubernetes cluster access - Testing specific edge cases that require controlled responses - Reproducing exact historical scenarios
Important: Even when using mocks, always validate with RUN_LIVE=true
in a real environment.
Mock Data Structure¶
Mock files are stored in tests/llm/fixtures/{test_name}/
directories:
- Each test has mock tool responses and expected outputs
- Mock responses are YAML files matching tool names
- Uses LLM-as-judge for automated evaluation
Generating Mock Data¶
# Generate mocks for a specific test
poetry run pytest tests/llm/test_ask_holmes.py -k "test_name" --generate-mocks
# Regenerate all mock files
poetry run pytest tests/llm/test_ask_holmes.py --regenerate-all-mocks
Mock Data Guidelines¶
When creating mock data:
- Never generate mock data manually - always use --generate-mocks
with live execution
- Mock data should match real-world responses exactly
- Include all fields that would be present in actual responses
- Maintain proper timestamps and data relationships
Important Notes¶
- Mock data captures only one investigation path - LLMs may take completely different approaches to reach the same conclusion
- Tests with mocks often fail when the LLM chooses a different but equally valid investigation strategy
- Mock execution misses the dynamic nature of real troubleshooting
- Always develop and validate tests with
RUN_LIVE=true
- Mock data becomes stale as APIs and tool behaviors evolve
Testing Workflow¶
- Develop test with
RUN_LIVE=true
- Generate mocks if needed:
--generate-mocks
- Validate mock execution matches live behavior
- Always use
RUN_LIVE=true
for final validation