Triggered Health Checks¶

A TriggeredHealthCheck runs an investigation automatically when a Deployment rolls out a new version — no per-deploy wiring, no CI polling. Declare it once, and every rollout of a matching Deployment (from CI, Argo, kubectl set image, or a rollback) fires a check.

It is the event-driven sibling of the ScheduledHealthCheck: both are self-contained (they embed the check definition inline) and both spawn a HealthCheck per run, which becomes the execution record.

Alpha

TriggeredHealthCheck currently supports a single trigger type — deploymentRollout. More event sources (pod crashloops, failed Jobs, alerts) are planned.

How it works¶

The operator watches Deployments in namespaces where TriggeredHealthCheck resources exist.
When a matching Deployment's pod template changes (a rollout), the operator waits delaySeconds (default 5 minutes) and then runs the check. The wait gives the rollout time to finish and gives any crashes or errors time to show up.
It creates a HealthCheck (owned by the trigger) with your query, having substituted the rollout context into it.
Holmes investigates using every connected data source; in alert mode it notifies your destinations on failure.

Example¶

apiVersion: holmesgpt.dev/v1alpha1
kind: TriggeredHealthCheck
metadata:
  name: verify-checkout-rollouts
  namespace: production
spec:
  deploymentRollout:
    selector:
      matchLabels:
        app: checkout-api
  delaySeconds: 300         # wait 5m after the rollout, then check (default)
  cooldownSeconds: 600      # don't re-fire for the same Deployment within 10m
  query: |
    checkout-api was rolled out to {{ .new.image }} (was {{ .old.image }}).
    Compare error rates, latency, restarts, and logs before vs after the rollout
    and flag any regressions.
  timeout: 120
  mode: alert
  destinations:
    - type: slack
      config:
        channel: "#deploy-alerts"

Apply it once:

kubectl apply -f triggeredhealthcheck.yaml

# List triggers (short name: thc)
kubectl get thc

# See fire history and the HealthChecks each rollout produced
kubectl describe thc verify-checkout-rollouts
kubectl get hc -l holmesgpt.dev/triggered-by=verify-checkout-rollouts

Query and rollout context¶

The rollout facts are always prepended to the query the check runs, so even a terse query like "Is the new version healthy?" gives the model what it needs:

This health check was triggered automatically by a Kubernetes Deployment rollout. Use this context when investigating:
- Deployment: checkout-api
- Namespace: production
- Previous image(s): myregistry/checkout-api:v2.4.0
- New image(s): myregistry/checkout-api:v2.4.1

<your query>

You can also reference the same facts inline in your query with these tokens:

Token	Replaced with
`{{ .deployment }}`	Name of the Deployment that rolled out
`{{ .namespace }}`	Its namespace
`{{ .old.image }}`	Container image(s) before the rollout
`{{ .new.image }}`	Container image(s) after the rollout

Spec reference¶

Field	Default	Description
`enabled`	`true`	Whether the trigger is active
`deploymentRollout.selector.matchLabels`	`{}`	Deployment labels that must all match. Empty matches every Deployment in the namespace.
`delaySeconds`	`300`	How long to wait after a rollout before running the check. Default 5 minutes; `0` checks immediately; `86400` checks a day later (max 7 days). See How long to wait.
`cooldownSeconds`	`0`	Suppress re-firing for the same Deployment within this window. `0` disables.
`query`	—	Natural-language investigation (supports the tokens above). Required.
`timeout`	`120`	Check execution timeout in seconds.
`mode`	`monitor`	`alert` notifies destinations on failure; `monitor` only records the result.
`model`	—	Override the default LLM model for this check.
`destinations`	`[]`	Alert destinations (used in `alert` mode). See Destinations.

How long to wait¶

There is one knob: delaySeconds — how long to wait after a rollout before running the check. That's it.

The check does not run the instant a new version is deployed, because a brand-new rollout hasn't finished and problems haven't surfaced yet. So Holmes waits delaySeconds first, then looks. Pick a value that fits what you want to catch:

300 (default, 5 minutes) — enough time for the rollout to finish and for crashes or startup errors to appear.
0 — check right away (useful if you only care that the deploy was accepted).
86400 (a day) — catch slower problems like memory leaks or resource creep that only show up after the version has been running a while.

If you set the wait too short for your app's rollout, the check may run while pods are still starting — Holmes will simply report that the rollout hasn't finished yet.

The wait is saved on the resource, so it still completes if the operator restarts — even a day-long wait. If the same Deployment is rolled out again while a check is still waiting, the waiting check is replaced so only the newest version is checked.

A common setup is one trigger with the default 5-minute wait, plus a second with delaySeconds: 86400 to re-check the same rollout a day later.

Notes & limitations¶

Rollout = pod-template change. Scaling and HPA changes (which only touch spec.replicas) do not fire the trigger; only changes to the pod template do.
Restart behavior. A check that was already scheduled before a restart still runs (the pending fire is persisted in status.pending). Detecting new rollouts, however, relies on an in-memory baseline of each Deployment's last-seen pod template (the operator does not annotate your Deployments). After a restart the first observation of each Deployment just re-establishes that baseline, so a rollout that happens during the restart window is not detected. Use a ScheduledHealthCheck for continuous coverage.
Cost. Every fire is at least one LLM call. Use cooldownSeconds and a specific selector to bound spend on busy namespaces.

Next Steps¶

Deployment Verification — patterns for gating and verifying deploys
Health Checks — the one-time checks this spawns
Alert Destinations — Slack and PagerDuty configuration