MCPG
Operations
Operations8 min

Observability

The signal triad (logs, metrics, traces), the audit ledger, and the two health probers that keep a fleet honest — what each sink kind is, where the master switches live, and how to wire scraping and tracing.

MCPG emits the OpenTelemetry signal triad plus a compliance audit ledger, and runs two independent health probers. This page is the day-2 operator's map of all of it: the sink taxonomy, the master switches, and the probers that decide whether a backend or plugin is healthy. For the step-by-step wiring walkthrough, see the Observability guide; for every config key, the Configuration reference.

The signal model

Telemetry lives under the top-level observability block. Each signal carries a sinks: [...] list; each sink's kind: dispatches to either a built-in factory or a plugin id resolved against the registry at boot.

yaml
observability:
  enabled: true          # master kill switch — false silences every child
  logs:    { enabled: true,  level: info, sinks: [  ] }
  metrics: { enabled: true,  sinks: [  ] }
  traces:  { enabled: false, sinks: [  ] }

observability.enabled: false is the master switch — it disables every child regardless of their own enabled: flags (useful for embedded hosts that own telemetry, or minimal test runs). Each signal also has its own enabled:; the effective state is the AND of the two.

SignalDefault stateBuilt-in kind: factoriesDefault sink
logsonstderr, stdout, fileone stderr JSON sink
metricson(none — all kinds are plugin ids)dev.mcpg.observability.prometheus
tracesoff(none — otlp ships as the dev.mcpg.observability.otlp plugin)empty — you add one

A sink kind: that isn't a built-in keyword is treated as a plugin id, so you can ship logs to Loki / Splunk, metrics to Datadog, or traces through any exporter plugin by naming its id in sinks[].

Metrics

Metrics are on by default and a factory-fresh gateway already serves Prometheus — the default metrics sink is the dev.mcpg.observability.prometheus plugin. There are no built-in factory kinds for metrics; every metrics sink kind: is a plugin id.

yaml
observability:
  metrics:
    sinks:
      - kind: dev.mcpg.observability.prometheus
        config:
          path: "/metrics"
          bind: "0.0.0.0:9090"

Scrape :9090/metrics. Gateway internals (every counter / gauge / histogram) and plugin-emitted metrics both flow through this sink.

yaml
# prometheus scrape config
- job_name: mcpg
  scrape_interval: 30s
  metrics_path: /metrics
  static_configs:
    - targets: ['mcpg.svc.cluster.local:9090']

A typical error-rate alert:

yaml
- alert: MCPGToolErrorRate
  expr: |
    sum(rate(mcpg_tool_errors_total[5m])) by (tool)
    / sum(rate(mcpg_tool_calls_total[5m])) by (tool)
    > 0.05
  for: 5m
  annotations:
    summary: "High error rate on {{ $labels.tool }}"

Traces

Traces are off by default — span construction has non-trivial overhead, so operators opt in. Enable the signal and add the dev.mcpg.observability.otlp sink (a plugin — the only built-in sink factories are stderr/stdout/file; otlp and prometheus ship as plugins):

yaml
observability:
  traces:
    enabled: true
    service_name: "mcpg-gateway"
    propagate_context: true        # W3C traceparent to outbound calls (default)
    sinks:
      - kind: dev.mcpg.observability.otlp
        config:
          url: "http://otel-collector.observability.svc.cluster.local:4318"

With propagate_context: true (the default), traceparent / tracestate propagate to outbound binding calls so downstream services join the same trace; an inbound traceparent is joined too. The OTLP sink's config.url should be your collector's HTTP/gRPC endpoint (http://, https://, or grpc://).

Dampening plugin-call spans

The host wraps every plugin FFI call in a span for attribution. On hot paths (tool-gate chains, metrics emit) that is 5–50 spans per call at ~5–20 µs each. If you sample traces low end-to-end and want to shed that host-side overhead without changing the global sampler, set a per-call sampling rate for plugin spans:

yaml
observability:
  plugin_call_sampling_rate: 0.01   # keep 1% of plugin-call spans; rest short-circuit

1.0 (or unset) is a no-op; the value must be in [0.0, 1.0].

Logs

The default is one stderr JSON sink. Production deployments append a file or plugin sink:

yaml
observability:
  logs:
    level: info                     # signal-wide severity floor
    sinks:
      - kind: stderr
        config: { format: json }
      - kind: file
        config:
          path: "/var/log/mcpg/gateway.log"
          format: json

level: (trace / debug / info / warn / error) is the signal-wide floor. A per-sink level: is parsed but today the signal-level floor is the enforced one. Both gateway internals and plugin-emitted log events flow through the sink list.

Audit ledger

Audit is a governance concern, not an observability one — it lives under governance.audit and is the tamper-evident record of every policy decision, plugin lifecycle event, and operator action. Full treatment is in the Observability guide and Security; the operator-critical facts:

yaml
governance:
  audit:
    enabled: true
    required: true          # gateway refuses to boot if no sink serves traffic
    on_failure: fail_closed # refuse the request if the audit emit fails
    sinks:
      - kind: dev.mcpg.builtin.audit.local-file
        config:
          path: "/var/log/mcpg/audit.log"

The two built-in audit sinks are dev.mcpg.builtin.audit.local-file (hash-chained JSON Lines, tamper-evident) and dev.mcpg.builtin.audit.tracing. required: true is the compliance-safe posture; set it false only for dev / CI. In a multi-replica fleet each replica writes its own ledger and a log-shipping agent (Fluent Bit, Vector, Filebeat) forwards it to your SIEM so the trail is greppable across pods.

The two health probers

MCPG runs two distinct probers, and conflating them is a common operational mistake.

The gateway liveness endpoint

gateway.server.health_path (default /health) is the gateway's own liveness endpoint for load balancers and Kubernetes probes. It answers "is this gateway process up." That is all — it does not reflect backend or plugin health.

The binding prober

gateway.server.health_check is a periodic prober that actively pings each binding's underlying service (the SQL server, the gRPC endpoint, the REST upstream) and flips each binding's PluginState between Active and Degraded.

yaml
gateway:
  server:
    health_check:
      enabled: true                       # default true
      interval_ms: 30                      # probe cadence (default 30)
      timeout_ms: 2000                     # per-probe deadline
      unhealthy_threshold: 3               # consecutive failures → Degraded
      degraded_latency_threshold_ms: 1000  # slow-but-up also marks Degraded

The plugin health probe

observability.plugin_health_probe is a separate prober that watches plugin liveness (the loaded cdylibs) and is the only writer of PluginState::Degraded for plugins. Without it, a wedged plugin stays perpetually Active.

yaml
observability:
  plugin_health_probe:
    enabled: true            # default true
    interval_ms: 30000        # probe cadence (default 30000)
    probe_timeout_ms: 5000    # per-probe FFI deadline
    failure_threshold: 3      # consecutive failures → Degraded

It lives under observability (not server) because it is observability-shaped: it watches liveness and writes the Degraded state that monitoring consumes.

Operator endpoints at a glance

EndpointSourcePurpose
gateway.server.health_path (default /health)gateway listenerLB / k8s liveness — process up?
:9090/metrics (Prometheus sink bind / path)metrics sinkPrometheus scrape target.
OTLP collector URLtraces sinkSpan export.
Operator :8443 (/metrics, /healthz, /readyz)k8s operatorOperator's own metrics + probes — see Kubernetes operator.

Validating

Every snippet here is part of the live AppConfig schema:

bash
mcpg-config-check ./config.yaml

What's next