Health Checks¶

HealthCheck resources provide one-time health check execution in Kubernetes. When you create a HealthCheck, the Holmes Operator immediately executes it using the Holmes API and stores the results in the resource's status.

What is a HealthCheck?¶

A HealthCheck is a Kubernetes Custom Resource that:

Runs immediately when created
Executes a natural language query using an LLM
Stores results (pass/fail/error) in its status
Can optionally send alerts to configured destinations
Maintains an audit trail of check execution
Can be re-run on demand using annotations

Creating a Simple Health Check¶

The simplest HealthCheck requires only a natural language query:

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: check-pod-health
  namespace: default
spec:
  query: "Are all pods in namespace 'default' healthy and running?"

Apply this check and view its status:

# Create the check
kubectl apply -f healthcheck.yaml

# View check status (short name: hc)
kubectl get hc

# Get detailed results
kubectl describe hc check-pod-health

Health Check with Alert Mode¶

To send notifications when a check fails, use alert mode with destinations:

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: frontend-deployment-check
  namespace: production
spec:
  query: "Is the 'frontend' deployment healthy with at least 3 ready replicas?"
  timeout: 60
  mode: alert
  destinations:
    - type: slack
      config:
        channel: "#production-alerts"

Spec Fields Reference¶

Required Fields¶

query (string, required)

Natural language question about system health. The LLM will analyze your cluster and answer this question.

Min length: 1 character
Max length: 5000 characters
Example: "Are all pods in deployment 'api' ready and not crash-looping?"

Optional Fields¶

timeout (integer, optional)

Maximum execution time in seconds before the check is terminated.

Default: 30 seconds
Minimum: 1 second
Maximum: 300 seconds (5 minutes)
Example: timeout: 120

mode (string, optional)

Execution mode that determines whether alerts are sent:

monitor (default): Results are stored but no alerts are sent
alert: Sends notifications to configured destinations on check failure

model (string, optional)

Override the default LLM model for this specific check. Useful for testing different models or controlling costs.

Example: model: "anthropic/claude-sonnet-4-5-20250929"
See AI Providers for supported models

destinations (array, optional)

List of alert destinations. Only used when mode: alert.

Each destination requires:

type: Destination type (e.g., "slack", "pagerduty")
config: Destination-specific configuration object

Example:

destinations:
  - type: slack
    config:
      channel: "#alerts"
  - type: pagerduty
    config:
      integration_key: "your-integration-key"

Status Fields¶

After execution, the HealthCheck status contains:

Execution Tracking¶

phase (string)

Current execution state:

Pending: Check created, waiting to start
Running: Check execution in progress
Completed: Check finished successfully
Failed: Check execution failed due to error

startTime (timestamp)

ISO 8601 timestamp when execution started.

completionTime (timestamp)

ISO 8601 timestamp when execution finished.

duration (number)

Total execution time in seconds.

Results¶

result (string)

The check outcome:

pass: System is healthy based on the query
fail: System is unhealthy based on the query
error: Execution failed (network issue, timeout, etc.)

message (string)

Brief human-readable summary of the result.

Example: "All 3 replicas of 'frontend' deployment are ready"

rationale (string)

Detailed LLM explanation of the decision, including evidence and reasoning.

error (string)

Error details if phase: Failed or result: error.

modelUsed (string)

The actual LLM model used for execution.

Notifications¶

notifications (array)

Status of alert delivery attempts when using mode: alert:

notifications:
  - type: slack
    channel: "#alerts"
    status: sent  # sent, failed, or skipped

Conditions¶

Standard Kubernetes conditions track the check lifecycle:

conditions:
  - type: Complete
    status: "True"
    lastTransitionTime: "2024-01-01T00:00:00Z"
    reason: Pass
    message: "Check completed successfully"

Viewing Results¶

List all checks in a namespace:

# Using full name
kubectl get healthchecks -n default

# Using short name
kubectl get hc -n default

# All namespaces
kubectl get hc --all-namespaces

View detailed check results:

# Describe shows full status including rationale
kubectl describe hc check-pod-health

# Get status as YAML
kubectl get hc check-pod-health -o yaml

Check specific fields:

# View just the result
kubectl get hc check-pod-health -o jsonpath='{.status.result}'

# View the message
kubectl get hc check-pod-health -o jsonpath='{.status.message}'

# View the rationale
kubectl get hc check-pod-health -o jsonpath='{.status.rationale}'

Re-running Checks¶

To re-execute a check, add the rerun annotation:

kubectl annotate hc check-pod-health holmesgpt.dev/rerun=true --overwrite

This triggers a new execution while preserving the original resource. The status will be updated with new results.

Practical Examples¶

Check Deployment Replicas¶

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: check-api-replicas
spec:
  query: "Does the 'api' deployment in 'production' namespace have at least 5 ready replicas?"
  timeout: 30

Check Pod Status¶

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: check-crashlooping-pods
spec:
  query: "Are there any CrashLoopBackOff pods in the 'production' namespace?"
  timeout: 45

Check Resource Usage¶

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: check-node-memory
spec:
  query: "Are any nodes in the cluster using more than 90% memory?"
  timeout: 60

Check with Alert¶

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: check-critical-pods
spec:
  query: "Are all pods with label 'tier=critical' in 'production' namespace running and ready?"
  timeout: 60
  mode: alert
  destinations:
    - type: slack
      config:
        channel: "#critical-alerts"

Custom Model for Cost Control¶

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: check-with-cheaper-model
spec:
  query: "Are all pods in 'staging' namespace healthy?"
  model: "anthropic/claude-sonnet-4-5-20250929"
  timeout: 30

Labels and Selectors¶

Use labels to organize and query HealthChecks:

apiVersion: holmesgpt.dev/v1alpha1
kind: HealthCheck
metadata:
  name: frontend-check
  labels:
    app: frontend
    environment: production
    team: platform
spec:
  query: "Is the frontend deployment healthy?"

Query by labels:

# Find all production checks
kubectl get hc -l environment=production

# Find checks for a specific app
kubectl get hc -l app=frontend

# Find checks by team
kubectl get hc -l team=platform

Next Steps¶

Scheduled Health Checks - Set up recurring checks with cron schedules
Alert Destinations - Configure Slack and PagerDuty notifications
Configuration - Advanced configuration options