Health Checks¶
HealthCheck resources provide one-time health check execution in Kubernetes. When you create a HealthCheck, the Holmes Operator immediately executes it using the Holmes API and stores the results in the resource's status.
What is a HealthCheck?¶
A HealthCheck is a Kubernetes Custom Resource that:
- Runs immediately when created
- Executes a natural language query using an LLM
- Stores results (pass/fail/error) in its status
- Can optionally send alerts to configured destinations
- Maintains an audit trail of check execution
- Can be re-run on demand using annotations
Creating a Simple Health Check¶
The simplest HealthCheck requires only a natural language query:
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: check-pod-health
namespace: default
spec:
query: "Is the default namespace healthy? Check pod status, recent restarts, resource usage, and warning events."
Apply this check and view its status:
# Create the check
kubectl apply -f healthcheck.yaml
# View check status (short name: hc)
kubectl get hc
# Get detailed results
kubectl describe hc check-pod-health
Health Check with Alert Mode¶
To send notifications when a check fails, use alert mode with destinations:
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: frontend-deployment-check
namespace: production
spec:
query: "Is the frontend deployment in production healthy? Check replicas, pod status, logs, and recent error rates."
timeout: 60
mode: alert
destinations:
- type: slack
config:
channel: "#production-alerts"
Spec Fields Reference¶
Required Fields¶
query (string, required)
Natural language question about system health. The LLM will analyze your cluster and answer this question.
- Min length: 1 character
- Max length: 5000 characters
- Example:
"Is the api deployment healthy? Check pod status, logs, and recent error rates."
Optional Fields¶
timeout (integer, optional)
Maximum execution time in seconds before the check is terminated.
- Default: 30 seconds
- Minimum: 1 second
- Maximum: 300 seconds (5 minutes)
- Example:
timeout: 120
mode (string, optional)
Execution mode that determines whether alerts are sent:
monitor(default): Results are stored but no alerts are sentalert: Sends notifications to configured destinations on check failure
model (string, optional)
Override the default LLM model for this specific check. Useful for testing different models or controlling costs.
- Example:
model: "anthropic/claude-sonnet-4-5-20250929" - See AI Providers for supported models
destinations (array, optional)
List of alert destinations. Only used when mode: alert.
Each destination requires:
type: Destination type (e.g., "slack", "pagerduty")config: Destination-specific configuration object
Example:
destinations:
- type: slack
config:
channel: "#alerts"
- type: pagerduty
config:
integration_key: "your-integration-key"
Status Fields¶
After execution, the HealthCheck status contains:
Execution Tracking¶
phase (string)
Current execution state:
Pending: Check created, waiting to startRunning: Check execution in progressCompleted: Check finished successfullyFailed: Check execution failed due to error
startTime (timestamp)
ISO 8601 timestamp when execution started.
completionTime (timestamp)
ISO 8601 timestamp when execution finished.
duration (number)
Total execution time in seconds.
Results¶
result (string)
The check outcome:
pass: System is healthy based on the queryfail: System is unhealthy based on the queryerror: Execution failed (network issue, timeout, etc.)
message (string)
Brief human-readable summary of the result.
Example: "All 3 replicas of 'frontend' deployment are ready"
rationale (string)
Detailed LLM explanation of the decision, including evidence and reasoning.
error (string)
Error details if phase: Failed or result: error.
modelUsed (string)
The actual LLM model used for execution.
Notifications¶
notifications (array)
Status of alert delivery attempts when using mode: alert:
Conditions¶
Standard Kubernetes conditions track the check lifecycle:
conditions:
- type: Complete
status: "True"
lastTransitionTime: "2024-01-01T00:00:00Z"
reason: Pass
message: "Check completed successfully"
Viewing Results¶
List all checks in a namespace:
# Using full name
kubectl get healthchecks -n default
# Using short name
kubectl get hc -n default
# All namespaces
kubectl get hc --all-namespaces
View detailed check results:
# Describe shows full status including rationale
kubectl describe hc check-pod-health
# Get status as YAML
kubectl get hc check-pod-health -o yaml
Check specific fields:
# View just the result
kubectl get hc check-pod-health -o jsonpath='{.status.result}'
# View the message
kubectl get hc check-pod-health -o jsonpath='{.status.message}'
# View the rationale
kubectl get hc check-pod-health -o jsonpath='{.status.rationale}'
Re-running Checks¶
To re-execute a check, add the rerun annotation:
This triggers a new execution while preserving the original resource. The status will be updated with new results.
Practical Examples¶
Check Deployment Replicas¶
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: check-api-replicas
spec:
query: "Is the api deployment in production running at full capacity? Check replica count, pod status, resource usage, and error logs."
timeout: 30
Check Pod Status¶
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: check-crashlooping-pods
spec:
query: "Are any pods in production failing to start or restarting frequently? Check logs and events for the root cause."
timeout: 45
Check Resource Usage¶
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: check-node-memory
spec:
query: "Are any cluster nodes under memory or CPU pressure? Check resource usage trends and flag anything approaching capacity."
timeout: 60
Check with Alert¶
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: check-critical-pods
spec:
query: "Are all tier=critical pods in production healthy? Check pod status, resource pressure, error rates, and logs for anomalies."
timeout: 60
mode: alert
destinations:
- type: slack
config:
channel: "#critical-alerts"
Custom Model for Cost Control¶
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: check-with-cheaper-model
spec:
query: "Is the staging namespace healthy? Check for pod failures, high resource usage, and errors in the logs."
model: "anthropic/claude-sonnet-4-5-20250929"
timeout: 30
Labels and Selectors¶
Use labels to organize and query HealthChecks:
apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
name: frontend-check
labels:
app: frontend
environment: production
team: platform
spec:
query: "Is the frontend deployment healthy? Check pod status, resource usage, and recent logs."
Query by labels:
# Find all production checks
kubectl get hc -l environment=production
# Find checks for a specific app
kubectl get hc -l app=frontend
# Find checks by team
kubectl get hc -l team=platform
Next Steps¶
- Scheduled Health Checks - Set up recurring checks with cron schedules
- Alert Destinations - Configure Slack and PagerDuty notifications
- Configuration - Advanced configuration options