Observability
The signal triad (logs, metrics, traces), the audit ledger, and the two health probers that keep a fleet honest — what each sink kind is, where the master switches live, and how to wire scraping and tracing.
MCPG emits the OpenTelemetry signal triad plus a compliance audit ledger, and runs two independent health probers. This page is the day-2 operator's map of all of it: the sink taxonomy, the master switches, and the probers that decide whether a backend or plugin is healthy. For the step-by-step wiring walkthrough, see the Observability guide; for every config key, the Configuration reference.
The signal model
Telemetry lives under the top-level observability block. Each signal carries a
sinks: [...] list; each sink's kind: dispatches to either a built-in factory or a plugin
id resolved against the registry at boot.
observability:
enabled: true # master kill switch — false silences every child
logs: { enabled: true, level: info, sinks: [ … ] }
metrics: { enabled: true, sinks: [ … ] }
traces: { enabled: false, sinks: [ … ] }
observability.enabled: false is the master switch — it disables every child regardless of
their own enabled: flags (useful for embedded hosts that own telemetry, or minimal test
runs). Each signal also has its own enabled:; the effective state is the AND of the two.
| Signal | Default state | Built-in kind: factories | Default sink |
|---|---|---|---|
logs | on | stderr, stdout, file | one stderr JSON sink |
metrics | on | (none — all kinds are plugin ids) | dev.mcpg.observability.prometheus |
traces | off | (none — otlp ships as the dev.mcpg.observability.otlp plugin) | empty — you add one |
A sink kind: that isn't a built-in keyword is treated as a plugin id, so you can ship logs
to Loki / Splunk, metrics to Datadog, or traces through any exporter plugin by naming its id
in sinks[].
Metrics
Metrics are on by default and a factory-fresh gateway already serves Prometheus — the default
metrics sink is the dev.mcpg.observability.prometheus plugin. There are no built-in factory
kinds for metrics; every metrics sink kind: is a plugin id.
observability:
metrics:
sinks:
- kind: dev.mcpg.observability.prometheus
config:
path: "/metrics"
bind: "0.0.0.0:9090"
Scrape :9090/metrics. Gateway internals (every counter / gauge / histogram) and
plugin-emitted metrics both flow through this sink.
# prometheus scrape config
- job_name: mcpg
scrape_interval: 30s
metrics_path: /metrics
static_configs:
- targets: ['mcpg.svc.cluster.local:9090']
A typical error-rate alert:
- alert: MCPGToolErrorRate
expr: |
sum(rate(mcpg_tool_errors_total[5m])) by (tool)
/ sum(rate(mcpg_tool_calls_total[5m])) by (tool)
> 0.05
for: 5m
annotations:
summary: "High error rate on {{ $labels.tool }}"
Traces
Traces are off by default — span construction has non-trivial overhead, so operators opt
in. Enable the signal and add the dev.mcpg.observability.otlp sink (a plugin — the only
built-in sink factories are stderr/stdout/file; otlp and prometheus ship as plugins):
observability:
traces:
enabled: true
service_name: "mcpg-gateway"
propagate_context: true # W3C traceparent to outbound calls (default)
sinks:
- kind: dev.mcpg.observability.otlp
config:
url: "http://otel-collector.observability.svc.cluster.local:4318"
With propagate_context: true (the default), traceparent / tracestate propagate to
outbound binding calls so downstream services join the same trace; an inbound traceparent
is joined too. The OTLP sink's config.url should be your collector's HTTP/gRPC endpoint
(http://, https://, or grpc://).
Dampening plugin-call spans
The host wraps every plugin FFI call in a span for attribution. On hot paths (tool-gate chains, metrics emit) that is 5–50 spans per call at ~5–20 µs each. If you sample traces low end-to-end and want to shed that host-side overhead without changing the global sampler, set a per-call sampling rate for plugin spans:
observability:
plugin_call_sampling_rate: 0.01 # keep 1% of plugin-call spans; rest short-circuit
1.0 (or unset) is a no-op; the value must be in [0.0, 1.0].
Logs
The default is one stderr JSON sink. Production deployments append a file or plugin sink:
observability:
logs:
level: info # signal-wide severity floor
sinks:
- kind: stderr
config: { format: json }
- kind: file
config:
path: "/var/log/mcpg/gateway.log"
format: json
level: (trace / debug / info / warn / error) is the signal-wide floor. A per-sink
level: is parsed but today the signal-level floor is the enforced one. Both gateway internals
and plugin-emitted log events flow through the sink list.
Audit ledger
Audit is a governance concern, not an observability one — it lives under governance.audit
and is the tamper-evident record of every policy decision, plugin lifecycle event, and operator
action. Full treatment is in the
Observability guide and
Security; the operator-critical facts:
governance:
audit:
enabled: true
required: true # gateway refuses to boot if no sink serves traffic
on_failure: fail_closed # refuse the request if the audit emit fails
sinks:
- kind: dev.mcpg.builtin.audit.local-file
config:
path: "/var/log/mcpg/audit.log"
The two built-in audit sinks are dev.mcpg.builtin.audit.local-file (hash-chained JSON
Lines, tamper-evident) and dev.mcpg.builtin.audit.tracing. required: true is the
compliance-safe posture; set it false only for dev / CI. In a multi-replica fleet each
replica writes its own ledger and a log-shipping agent (Fluent Bit, Vector, Filebeat) forwards
it to your SIEM so the trail is greppable across pods.
The two health probers
MCPG runs two distinct probers, and conflating them is a common operational mistake.
The gateway liveness endpoint
gateway.server.health_path (default /health) is the gateway's own liveness endpoint for
load balancers and Kubernetes probes. It answers "is this gateway process up." That is all —
it does not reflect backend or plugin health.
The binding prober
gateway.server.health_check is a periodic prober that actively pings each binding's
underlying service (the SQL server, the gRPC endpoint, the REST upstream) and flips each
binding's PluginState between Active and Degraded.
gateway:
server:
health_check:
enabled: true # default true
interval_ms: 30 # probe cadence (default 30)
timeout_ms: 2000 # per-probe deadline
unhealthy_threshold: 3 # consecutive failures → Degraded
degraded_latency_threshold_ms: 1000 # slow-but-up also marks Degraded
The plugin health probe
observability.plugin_health_probe is a separate prober that watches plugin liveness (the
loaded cdylibs) and is the only writer of PluginState::Degraded for plugins. Without it, a
wedged plugin stays perpetually Active.
observability:
plugin_health_probe:
enabled: true # default true
interval_ms: 30000 # probe cadence (default 30000)
probe_timeout_ms: 5000 # per-probe FFI deadline
failure_threshold: 3 # consecutive failures → Degraded
It lives under observability (not server) because it is observability-shaped: it watches
liveness and writes the Degraded state that monitoring consumes.
Operator endpoints at a glance
| Endpoint | Source | Purpose |
|---|---|---|
gateway.server.health_path (default /health) | gateway listener | LB / k8s liveness — process up? |
:9090/metrics (Prometheus sink bind / path) | metrics sink | Prometheus scrape target. |
| OTLP collector URL | traces sink | Span export. |
Operator :8443 (/metrics, /healthz, /readyz) | k8s operator | Operator's own metrics + probes — see Kubernetes operator. |
Validating
Every snippet here is part of the live AppConfig schema:
mcpg-config-check ./config.yaml
What's next
- Observability guide — step-by-step wiring walkthrough
- Deployment topologies — where these settings live per shape
- Kubernetes operator — operator metrics, ServiceMonitor, Grafana
- Configuration reference — every config key