Skip to content

OpenTelemetry Observability

HolmesGPT includes built-in OpenTelemetry (OTel) instrumentation that produces distributed traces and metrics for every investigation. This enables end-to-end observability from user prompt through LLM calls and MCP tool execution.

Enabling OpenTelemetry

OTel instrumentation activates automatically when the OTEL_EXPORTER_OTLP_ENDPOINT environment variable is set. No code changes or flags are needed.

CLI

# Run with OTel enabled
export OTEL_EXPORTER_OTLP_ENDPOINT="http://your-otel-collector:4317"
export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
export OTEL_SERVICE_NAME="holmesgpt"
holmes ask "Why is my pod crashing?"

Helm / Kubernetes

Add the following to your Helm values.yaml:

additionalEnvVars:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.monitoring.svc:4317"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_SERVICE_NAME
    value: "holmesgpt"
  # Optional: for backends requiring auth (e.g., Dynatrace, Grafana Cloud)
  - name: OTEL_EXPORTER_OTLP_HEADERS
    value: "Authorization=Api-Token YOUR_TOKEN"

Environment Variables

Variable Description Default
OTEL_EXPORTER_OTLP_ENDPOINT OTLP collector endpoint (enables OTel when set) (unset — OTel disabled)
OTEL_EXPORTER_OTLP_PROTOCOL Export protocol (grpc or http/protobuf) grpc
OTEL_SERVICE_NAME Service name in traces/metrics holmesgpt
OTEL_EXPORTER_OTLP_HEADERS Headers for OTLP exporter (e.g., auth tokens) (none)
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT Override metrics endpoint (if different from traces) (uses OTEL_EXPORTER_OTLP_ENDPOINT)

When OTEL_EXPORTER_OTLP_ENDPOINT is not set, HolmesGPT uses a no-op DummyTracer with zero overhead.

Distributed Traces

Every investigation produces a trace hierarchy:

holmesgpt.investigation (root span)
├── gen_ai.chat (LLM iteration 0 — includes token counts)
│   └── POST (auto-instrumented httpx → LLM provider)
├── holmesgpt.tool.<name> (tool/MCP call)
│   └── POST (auto-instrumented httpx → MCP server)
│       └── MCP server spans (execute_tool, k8s.api/*, etc.)
├── gen_ai.chat (LLM iteration 1)
│   └── ...
└── gen_ai.chat (final iteration — produces answer)

Span Attributes

Investigation span (holmesgpt.investigation): - holmesgpt.investigation.question — the user's question - holmesgpt.investigation.stream — whether streaming was used

LLM spans (gen_ai.chat): - gen_ai.system — LLM provider (litellm) - gen_ai.request.model — model name - gen_ai.usage.input_tokens — prompt tokens - gen_ai.usage.output_tokens — completion tokens - gen_ai.usage.total_tokens — total tokens - holmesgpt.iteration — iteration number (0-based)

Tool spans (holmesgpt.tool.<name>): - holmesgpt.tool.name — tool name - holmesgpt.tool.status — result status (success or error)

MCP Trace Propagation

When HolmesGPT calls MCP tools over HTTP, trace context is automatically propagated via W3C traceparent headers (using httpx auto-instrumentation). MCP servers that support OpenTelemetry will create child spans linked to the same trace.

Metrics

HolmesGPT exports the following OTel metrics via OTLP. All metrics use underscore-delimited attribute keys for maximum compatibility across backends.

Token Usage

Metric Type Unit Description
gen_ai.client.token.usage Counter {token} LLM token consumption

Dimensions: gen_ai_request_model, gen_ai_system, gen_ai_token_type (input or output)

Investigation Metrics

Metric Type Unit Description
holmesgpt.investigation.count Counter {investigation} Number of investigations started
holmesgpt.investigation.duration Histogram s End-to-end investigation duration
holmesgpt.investigation.iterations Histogram {iteration} LLM iterations per investigation

Dimensions: gen_ai_request_model

LLM Call Metrics

Metric Type Unit Description
gen_ai.client.operation.duration Histogram s Individual LLM call latency

Dimensions: gen_ai_request_model, gen_ai_system

Tool / MCP Call Metrics

Metric Type Unit Description
holmesgpt.tool.call.count Counter {call} Number of tool calls
holmesgpt.tool.call.duration Histogram s Tool call latency
holmesgpt.tool.call.errors Counter {error} Failed tool calls

Dimensions: holmesgpt_tool_name

Example Queries

Dynatrace DQL:

# Tool call duration by tool name
timeseries avg(holmesgpt.tool.call.duration, default:0), by: {holmesgpt_tool_name}

# Token usage by model
timeseries sum(gen_ai.client.token.usage, default:0), by: {gen_ai_request_model, gen_ai_token_type}

# Investigation count over time
timeseries sum(holmesgpt.investigation.count, default:0)

# Slowest tools (p95)
timeseries percentile(holmesgpt.tool.call.duration, 95), by: {holmesgpt_tool_name}

PromQL (Grafana / Prometheus):

# Tool call rate by tool name
rate(holmesgpt_tool_call_count_total[5m])

# Average LLM call duration
rate(gen_ai_client_operation_duration_sum[5m]) / rate(gen_ai_client_operation_duration_count[5m])

Backend Examples

Dynatrace (direct OTLP)

additionalEnvVars:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "https://YOUR_ENV.live.dynatrace.com/api/v2/otlp"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_SERVICE_NAME
    value: "holmesgpt"
  - name: OTEL_EXPORTER_OTLP_HEADERS
    value: "Authorization=Api-Token YOUR_DT_TOKEN"

OTel Collector (self-hosted)

additionalEnvVars:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.monitoring.svc:4317"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_SERVICE_NAME
    value: "holmesgpt"

Grafana Cloud

additionalEnvVars:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "https://otlp-gateway-prod-us-east-0.grafana.net/otlp"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_SERVICE_NAME
    value: "holmesgpt"
  - name: OTEL_EXPORTER_OTLP_HEADERS
    value: "Authorization=Basic BASE64_ENCODED_INSTANCE:TOKEN"

Architecture

┌──────────────────────────────────────────────────────┐
│                    HolmesGPT                         │
│                                                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
│  │  Tracer   │  │  Metrics  │  │ httpx auto-instr │   │
│  │ (traces)  │  │(counters, │  │ (W3C traceparent)│   │
│  │          │  │histograms)│  │                  │   │
│  └────┬─────┘  └────┬─────┘  └────────┬─────────┘   │
│       │              │                 │             │
└───────┼──────────────┼─────────────────┼─────────────┘
        │              │                 │
        │  OTLP/gRPC   │                 │ W3C traceparent
        ▼              ▼                 ▼
┌─────────────────┐              ┌─────────────────┐
│  OTel Collector  │              │   MCP Servers    │
│  or Direct OTLP  │              │  (child spans)   │
└────────┬────────┘              └─────────────────┘
┌─────────────────┐
│    Backend       │
│ (Dynatrace,     │
│  Grafana, etc.) │
└─────────────────┘