Triggered Health Checks¶
A TriggeredHealthCheck runs an investigation automatically when a Deployment rolls
out a new version — no per-deploy wiring, no CI polling. Declare it once, and every
rollout of a matching Deployment (from CI, Argo, kubectl set image, or a rollback)
fires a check.
It is the event-driven sibling of the ScheduledHealthCheck: both are self-contained (they embed the check definition inline) and both spawn a HealthCheck per run, which becomes the execution record.
Alpha
TriggeredHealthCheck currently supports a single trigger type — deploymentRollout.
More event sources (pod crashloops, failed Jobs, alerts) are planned.
How it works¶
- The operator watches Deployments in namespaces where
TriggeredHealthCheckresources exist. - When a matching Deployment's pod template changes (a rollout), the operator waits
delaySeconds(default 5 minutes) and then runs the check. The wait gives the rollout time to finish and gives any crashes or errors time to show up. - It creates a
HealthCheck(owned by the trigger) with your query, having substituted the rollout context into it. - Holmes investigates using every connected data source; in
alertmode it notifies your destinations on failure.
Example¶
apiVersion: holmesgpt.dev/v1alpha1
kind: TriggeredHealthCheck
metadata:
name: verify-checkout-rollouts
namespace: production
spec:
deploymentRollout:
selector:
matchLabels:
app: checkout-api
delaySeconds: 300 # wait 5m after the rollout, then check (default)
cooldownSeconds: 600 # don't re-fire for the same Deployment within 10m
query: |
checkout-api was rolled out to {{ .new.image }} (was {{ .old.image }}).
Compare error rates, latency, restarts, and logs before vs after the rollout
and flag any regressions.
timeout: 120
mode: alert
destinations:
- type: slack
config:
channel: "#deploy-alerts"
Apply it once:
kubectl apply -f triggeredhealthcheck.yaml
# List triggers (short name: thc)
kubectl get thc
# See fire history and the HealthChecks each rollout produced
kubectl describe thc verify-checkout-rollouts
kubectl get hc -l holmesgpt.dev/triggered-by=verify-checkout-rollouts
Query and rollout context¶
The rollout facts are always prepended to the query the check runs, so even a terse
query like "Is the new version healthy?" gives the model what it needs:
This health check was triggered automatically by a Kubernetes Deployment rollout. Use this context when investigating:
- Deployment: checkout-api
- Namespace: production
- Previous image(s): myregistry/checkout-api:v2.4.0
- New image(s): myregistry/checkout-api:v2.4.1
<your query>
You can also reference the same facts inline in your query with these tokens:
| Token | Replaced with |
|---|---|
{{ .deployment }} |
Name of the Deployment that rolled out |
{{ .namespace }} |
Its namespace |
{{ .old.image }} |
Container image(s) before the rollout |
{{ .new.image }} |
Container image(s) after the rollout |
Spec reference¶
| Field | Default | Description |
|---|---|---|
enabled |
true |
Whether the trigger is active |
deploymentRollout.selector.matchLabels |
{} |
Deployment labels that must all match. Empty matches every Deployment in the namespace. |
delaySeconds |
300 |
How long to wait after a rollout before running the check. Default 5 minutes; 0 checks immediately; 86400 checks a day later (max 7 days). See How long to wait. |
cooldownSeconds |
0 |
Suppress re-firing for the same Deployment within this window. 0 disables. |
query |
— | Natural-language investigation (supports the tokens above). Required. |
timeout |
120 |
Check execution timeout in seconds. |
mode |
monitor |
alert notifies destinations on failure; monitor only records the result. |
model |
— | Override the default LLM model for this check. |
destinations |
[] |
Alert destinations (used in alert mode). See Destinations. |
How long to wait¶
There is one knob: delaySeconds — how long to wait after a rollout before running
the check. That's it.
The check does not run the instant a new version is deployed, because a brand-new
rollout hasn't finished and problems haven't surfaced yet. So Holmes waits delaySeconds
first, then looks. Pick a value that fits what you want to catch:
300(default, 5 minutes) — enough time for the rollout to finish and for crashes or startup errors to appear.0— check right away (useful if you only care that the deploy was accepted).86400(a day) — catch slower problems like memory leaks or resource creep that only show up after the version has been running a while.
If you set the wait too short for your app's rollout, the check may run while pods are still starting — Holmes will simply report that the rollout hasn't finished yet.
The wait is saved on the resource, so it still completes if the operator restarts — even a day-long wait. If the same Deployment is rolled out again while a check is still waiting, the waiting check is replaced so only the newest version is checked.
A common setup is one trigger with the default 5-minute wait, plus a second with
delaySeconds: 86400 to re-check the same rollout a day later.
Notes & limitations¶
- Rollout = pod-template change. Scaling and HPA changes (which only touch
spec.replicas) do not fire the trigger; only changes to the pod template do. - Restart behavior. A check that was already scheduled before a restart still runs
(the pending fire is persisted in
status.pending). Detecting new rollouts, however, relies on an in-memory baseline of each Deployment's last-seen pod template (the operator does not annotate your Deployments). After a restart the first observation of each Deployment just re-establishes that baseline, so a rollout that happens during the restart window is not detected. Use a ScheduledHealthCheck for continuous coverage. - Cost. Every fire is at least one LLM call. Use
cooldownSecondsand a specificselectorto bound spend on busy namespaces.
Next Steps¶
- Deployment Verification — patterns for gating and verifying deploys
- Health Checks — the one-time checks this spawns
- Alert Destinations — Slack and PagerDuty configuration