Observability

The signal triad (logs, metrics, traces), the audit ledger, and the two health probers that keep a fleet honest — what each sink kind is, where the master switches live, and how to wire scraping and tracing.

MCPG emits the OpenTelemetry signal triad plus a compliance audit ledger, and runs two independent health probers. This page is the day-2 operator's map of all of it: the sink taxonomy, the master switches, and the probers that decide whether a backend or plugin is healthy. For every config key, see the Configuration reference.

The signal model

Telemetry lives under the top-level observability block. Each signal carries a sinks: [...] list; each sink's kind: dispatches to either a built-in factory or a plugin id resolved against the registry at boot.

yaml

observability:
  enabled: true          # master kill switch — false silences every child
  logs:    { enabled: true,  level: info, sinks: [ … ] }
  metrics: { enabled: true,  sinks: [ … ] }
  traces:  { enabled: false, sinks: [ … ] }

observability.enabled: false is the master switch — it disables every child regardless of their own enabled: flags (useful for embedded hosts that own telemetry, or minimal test runs). Each signal also has its own enabled:; the effective state is the AND of the two.

Signal	Default state	Built-in `kind:` factories	Default sink
`logs`	on	`stderr`, `stdout`, `file`	one `stderr` JSON sink
`metrics`	on	(none — all kinds are plugin ids)	`dev.mcpg.observability.prometheus`
`traces`	off	(none — `otlp` ships as the `dev.mcpg.observability.otlp` plugin)	empty — you add one

A sink kind: that isn't a built-in keyword is treated as a plugin id, so you can ship logs to Loki / Splunk, metrics to Datadog, or traces through any exporter plugin by naming its id in sinks[].

Metrics

Metrics are on by default and a factory-fresh gateway already serves Prometheus — the default metrics sink is the dev.mcpg.observability.prometheus plugin. There are no built-in factory kinds for metrics; every metrics sink kind: is a plugin id.

yaml

observability:
  metrics:
    sinks:
      - kind: dev.mcpg.observability.prometheus
        config:
          path: "/metrics"
          bind: "0.0.0.0:9090"

Scrape :9090/metrics. Gateway internals (every counter / gauge / histogram) and plugin-emitted metrics both flow through this sink.

yaml

# prometheus scrape config
- job_name: mcpg
  scrape_interval: 30s
  metrics_path: /metrics
  static_configs:
    - targets: ['mcpg.svc.cluster.local:9090']

A typical error-rate alert:

yaml

- alert: MCPGToolErrorRate
  expr: |
    sum(rate(mcpg_tool_errors_total[5m])) by (tool)
    / sum(rate(mcpg_tool_calls_total[5m])) by (tool)
    > 0.05
  for: 5m
  annotations:
    summary: "High error rate on {{ $labels.tool }}"

Traces

Traces are off by default — span construction has non-trivial overhead, so operators opt in. Enable the signal and add the dev.mcpg.observability.otlp sink (a plugin — the only built-in sink factories are stderr/stdout/file; otlp and prometheus ship as plugins):

yaml

observability:
  traces:
    enabled: true
    service_name: "mcpg-gateway"
    propagate_context: true        # W3C traceparent to outbound calls (default)
    sinks:
      - kind: dev.mcpg.observability.otlp
        config:
          url: "http://otel-collector.observability.svc.cluster.local:4318"

With propagate_context: true (the default), traceparent / tracestate propagate to outbound binding calls so downstream services join the same trace; an inbound traceparent is joined too. The OTLP sink's config.url should be your collector's HTTP/gRPC endpoint (http://, https://, or grpc://).

Dampening plugin-call spans

The host wraps every plugin FFI call in a span for attribution. On hot paths (tool-gate chains, metrics emit) that is 5–50 spans per call at ~5–20 µs each. If you sample traces low end-to-end and want to shed that host-side overhead without changing the global sampler, set a per-call sampling rate for plugin spans:

yaml

observability:
  plugin_call_sampling_rate: 0.01   # keep 1% of plugin-call spans; rest short-circuit

1.0 (or unset) is a no-op; the value must be in [0.0, 1.0].

Logs

The default is one stderr JSON sink. Production deployments append a file or plugin sink:

yaml

observability:
  logs:
    level: info                     # signal-wide severity floor
    sinks:
      - kind: stderr
        config: { format: json }
      - kind: file
        config:
          path: "/var/log/mcpg/gateway.log"
          format: json

level: (trace / debug / info / warn / error) is the signal-wide floor. A per-sink level: is parsed but today the signal-level floor is the enforced one. Both gateway internals and plugin-emitted log events flow through the sink list.

Audit ledger

Audit is a governance concern, not an observability one — it lives under governance.audit and is the tamper-evident record of every policy decision, plugin lifecycle event, and operator action. Full treatment is in the Audit trail and the Security section; the operator-critical facts:

yaml

governance:
  audit:
    enabled: true
    required: true          # gateway refuses to boot if no sink serves traffic
    on_failure: fail_closed # refuse the request if the audit emit fails
    sinks:
      - kind: dev.mcpg.builtin.audit.local-file
        config:
          path: "/var/log/mcpg/audit.log"

The two built-in audit sinks are dev.mcpg.builtin.audit.local-file (hash-chained JSON Lines, tamper-evident) and dev.mcpg.builtin.audit.tracing. required: true is the compliance-safe posture; set it false only for dev / CI. In a multi-replica fleet each replica writes its own ledger and a log-shipping agent (Fluent Bit, Vector, Filebeat) forwards it to your SIEM so the trail is greppable across pods.

The two health probers

MCPG runs two distinct probers, and conflating them is a common operational mistake.

The gateway liveness endpoint

gateway.server.health_path (default /health) is the gateway's own liveness endpoint for load balancers and Kubernetes probes. It answers "is this gateway process up." That is all — it does not reflect backend or plugin health.

The binding prober

gateway.server.health_check is a periodic prober that actively pings each binding's underlying service (the SQL server, the gRPC endpoint, the REST upstream) and flips each binding's PluginState between Active and Degraded.

yaml

gateway:
  server:
    health_check:
      enabled: true                       # default true
      interval_ms: 30                      # probe cadence (default 30)
      timeout_ms: 2000                     # per-probe deadline
      unhealthy_threshold: 3               # consecutive failures → Degraded
      degraded_latency_threshold_ms: 1000  # slow-but-up also marks Degraded

The plugin health probe

observability.plugin_health_probe is a separate prober that watches plugin liveness (the loaded cdylibs) and is the only writer of PluginState::Degraded for plugins. Without it, a wedged plugin stays perpetually Active.

yaml

observability:
  plugin_health_probe:
    enabled: true            # default true
    interval_ms: 30000        # probe cadence (default 30000)
    probe_timeout_ms: 5000    # per-probe FFI deadline
    failure_threshold: 3      # consecutive failures → Degraded

It lives under observability (not server) because it is observability-shaped: it watches liveness and writes the Degraded state that monitoring consumes.

Operator endpoints at a glance

Endpoint	Source	Purpose
`gateway.server.health_path` (default `/health`)	gateway listener	LB / k8s liveness — process up?
`:9090/metrics` (Prometheus sink `bind` / `path`)	metrics sink	Prometheus scrape target.
OTLP collector URL	traces sink	Span export.
Operator `:8443` (`/metrics`, `/healthz`, `/readyz`)	k8s operator	Operator's own metrics + probes — see Kubernetes operator.

Validating

Every snippet here is part of the live AppConfig schema:

bash

mcpg config check ./config.yaml

What's next

Day-2 operations & upgrades — rolling upgrades, config reloads, backups
Deployment topologies — where these settings live per shape
Kubernetes operator — operator metrics, ServiceMonitor, Grafana
Audit trail — the tamper-evident ledger in depth
Configuration reference — every config key