OpenTelemetry Observability¶
HolmesGPT includes built-in OpenTelemetry (OTel) instrumentation that produces distributed traces and metrics for every investigation. This enables end-to-end observability from user prompt through LLM calls and MCP tool execution.
Enabling OpenTelemetry¶
OTel instrumentation activates automatically when the OTEL_EXPORTER_OTLP_ENDPOINT environment variable is set. No code changes or flags are needed.
CLI¶
# Run with OTel enabled
export OTEL_EXPORTER_OTLP_ENDPOINT="http://your-otel-collector:4317"
export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
export OTEL_SERVICE_NAME="holmesgpt"
holmes ask "Why is my pod crashing?"
Helm / Kubernetes¶
Add the following to your Helm values.yaml:
additionalEnvVars:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.monitoring.svc:4317"
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: "grpc"
- name: OTEL_SERVICE_NAME
value: "holmesgpt"
# Optional: for backends requiring auth (e.g., Dynatrace, Grafana Cloud)
- name: OTEL_EXPORTER_OTLP_HEADERS
value: "Authorization=Api-Token YOUR_TOKEN"
Environment Variables¶
| Variable | Description | Default |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP collector endpoint (enables OTel when set) | (unset — OTel disabled) |
OTEL_EXPORTER_OTLP_PROTOCOL |
Export protocol (grpc or http/protobuf) |
grpc |
OTEL_SERVICE_NAME |
Service name in traces/metrics | holmesgpt |
OTEL_EXPORTER_OTLP_HEADERS |
Headers for OTLP exporter (e.g., auth tokens) | (none) |
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT |
Override metrics endpoint (if different from traces) | (uses OTEL_EXPORTER_OTLP_ENDPOINT) |
When OTEL_EXPORTER_OTLP_ENDPOINT is not set, HolmesGPT uses a no-op DummyTracer with zero overhead.
Distributed Traces¶
Every investigation produces a trace hierarchy:
holmesgpt.investigation (root span)
├── gen_ai.chat (LLM iteration 0 — includes token counts)
│ └── POST (auto-instrumented httpx → LLM provider)
├── holmesgpt.tool.<name> (tool/MCP call)
│ └── POST (auto-instrumented httpx → MCP server)
│ └── MCP server spans (execute_tool, k8s.api/*, etc.)
├── gen_ai.chat (LLM iteration 1)
│ └── ...
└── gen_ai.chat (final iteration — produces answer)
Span Attributes¶
Investigation span (holmesgpt.investigation):
- holmesgpt.investigation.question — the user's question
- holmesgpt.investigation.stream — whether streaming was used
LLM spans (gen_ai.chat):
- gen_ai.system — LLM provider (litellm)
- gen_ai.request.model — model name
- gen_ai.usage.input_tokens — prompt tokens
- gen_ai.usage.output_tokens — completion tokens
- gen_ai.usage.total_tokens — total tokens
- holmesgpt.iteration — iteration number (0-based)
Tool spans (holmesgpt.tool.<name>):
- holmesgpt.tool.name — tool name
- holmesgpt.tool.status — result status (success or error)
MCP Trace Propagation¶
When HolmesGPT calls MCP tools over HTTP, trace context is automatically propagated via W3C traceparent headers (using httpx auto-instrumentation). MCP servers that support OpenTelemetry will create child spans linked to the same trace.
Metrics¶
HolmesGPT exports the following OTel metrics via OTLP. All metrics use underscore-delimited attribute keys for maximum compatibility across backends.
Token Usage¶
| Metric | Type | Unit | Description |
|---|---|---|---|
gen_ai.client.token.usage |
Counter | {token} |
LLM token consumption |
Dimensions: gen_ai_request_model, gen_ai_system, gen_ai_token_type (input or output)
Investigation Metrics¶
| Metric | Type | Unit | Description |
|---|---|---|---|
holmesgpt.investigation.count |
Counter | {investigation} |
Number of investigations started |
holmesgpt.investigation.duration |
Histogram | s |
End-to-end investigation duration |
holmesgpt.investigation.iterations |
Histogram | {iteration} |
LLM iterations per investigation |
Dimensions: gen_ai_request_model
LLM Call Metrics¶
| Metric | Type | Unit | Description |
|---|---|---|---|
gen_ai.client.operation.duration |
Histogram | s |
Individual LLM call latency |
Dimensions: gen_ai_request_model, gen_ai_system
Tool / MCP Call Metrics¶
| Metric | Type | Unit | Description |
|---|---|---|---|
holmesgpt.tool.call.count |
Counter | {call} |
Number of tool calls |
holmesgpt.tool.call.duration |
Histogram | s |
Tool call latency |
holmesgpt.tool.call.errors |
Counter | {error} |
Failed tool calls |
Dimensions: holmesgpt_tool_name
Example Queries¶
Dynatrace DQL:
# Tool call duration by tool name
timeseries avg(holmesgpt.tool.call.duration, default:0), by: {holmesgpt_tool_name}
# Token usage by model
timeseries sum(gen_ai.client.token.usage, default:0), by: {gen_ai_request_model, gen_ai_token_type}
# Investigation count over time
timeseries sum(holmesgpt.investigation.count, default:0)
# Slowest tools (p95)
timeseries percentile(holmesgpt.tool.call.duration, 95), by: {holmesgpt_tool_name}
PromQL (Grafana / Prometheus):
# Tool call rate by tool name
rate(holmesgpt_tool_call_count_total[5m])
# Average LLM call duration
rate(gen_ai_client_operation_duration_sum[5m]) / rate(gen_ai_client_operation_duration_count[5m])
Backend Examples¶
Dynatrace (direct OTLP)¶
additionalEnvVars:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "https://YOUR_ENV.live.dynatrace.com/api/v2/otlp"
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: "grpc"
- name: OTEL_SERVICE_NAME
value: "holmesgpt"
- name: OTEL_EXPORTER_OTLP_HEADERS
value: "Authorization=Api-Token YOUR_DT_TOKEN"
OTel Collector (self-hosted)¶
additionalEnvVars:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.monitoring.svc:4317"
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: "grpc"
- name: OTEL_SERVICE_NAME
value: "holmesgpt"
Grafana Cloud¶
additionalEnvVars:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "https://otlp-gateway-prod-us-east-0.grafana.net/otlp"
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: "grpc"
- name: OTEL_SERVICE_NAME
value: "holmesgpt"
- name: OTEL_EXPORTER_OTLP_HEADERS
value: "Authorization=Basic BASE64_ENCODED_INSTANCE:TOKEN"
Architecture¶
┌──────────────────────────────────────────────────────┐
│ HolmesGPT │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Tracer │ │ Metrics │ │ httpx auto-instr │ │
│ │ (traces) │ │(counters, │ │ (W3C traceparent)│ │
│ │ │ │histograms)│ │ │ │
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │
└───────┼──────────────┼─────────────────┼─────────────┘
│ │ │
│ OTLP/gRPC │ │ W3C traceparent
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ OTel Collector │ │ MCP Servers │
│ or Direct OTLP │ │ (child spans) │
└────────┬────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Backend │
│ (Dynatrace, │
│ Grafana, etc.) │
└─────────────────┘