Observability — Prometheus, OTLP, audit

Wire MCPG to your existing telemetry pipeline. Prometheus exposition, OpenTelemetry traces, and a tamper-evident audit ledger.

MCPG emits three streams:

Prometheus metrics — counters and histograms, fleet-aggregated
OpenTelemetry traces — per-request spans with W3C trace context
Audit ledger — every policy decision, with Ed25519 chain signatures

All three are on by default and can be disabled or rerouted independently.

Prometheus

The control plane exposes a Prometheus exposition endpoint at:

bash

GET /v1/orgs/{org}/metrics?granularity=hour

Returns the rolled-up metrics for the entire fleet, labeled by org / workspace / plugin / tool:

ini

# HELP mcpg_tool_calls_total Total tool invocations
# TYPE mcpg_tool_calls_total counter
mcpg_tool_calls_total{org="default",workspace="prod",plugin="github",tool="list_repos"} 4827

# HELP mcpg_tool_errors_total Total tool errors
# TYPE mcpg_tool_errors_total counter
mcpg_tool_errors_total{org="default",workspace="prod",plugin="github",tool="list_repos"} 12

# HELP mcpg_tool_latency_ms Tool latency quantiles
# TYPE mcpg_tool_latency_ms gauge
mcpg_tool_latency_ms{quantile="0.5",org="default",plugin="github",tool="list_repos"} 84
mcpg_tool_latency_ms{quantile="0.95",org="default",plugin="github",tool="list_repos"} 312
mcpg_tool_latency_ms{quantile="0.99",org="default",plugin="github",tool="list_repos"} 891

Scrape config:

yaml

- job_name: mcpg-fleet
  scrape_interval: 30s
  metrics_path: /v1/orgs/default/metrics
  params:
    granularity: [hour]
  static_configs:
    - targets: ['mcpg-cp.svc.cluster.local:7843']

The gateway itself also exposes Prometheus metrics directly at :9090/metrics for local-only scraping (legacy compatibility) — but for fleet observability, prefer the CP endpoint.

OpenTelemetry

Every request creates a span. Spans are exported via OTLP to whichever collector you point at:

yaml

observability:
  otlp:
    endpoint: https://otel-collector.example.com:4317
    headers:
      authorization: "Bearer ${OTEL_TOKEN}"
    sample_rate: 0.1     # 10% of requests

W3C trace context propagates inbound and outbound — if your client sends traceparent, MCPG joins the trace. If your upstream tool accepts traceparent, MCPG sends it.

Audit ledger

Every policy decision (permit, deny, with-obligations), every plugin lifecycle event, and every operator action writes an audit row. The ledger is:

Per-org chained — each row references the previous row's hash, forming a per-org Merkle chain. Tampering with any row breaks the chain.
Ed25519-signed — the chain head is signed periodically. Verifiers can check the signature without trusting the database.
Retention-bounded — Community: 30 days, Pro: 90, Team: 180, Enterprise: 7 years.

Query via the HTTP API:

bash

curl 'https://mcpg-cp.example.com/v1/orgs/default/audit?action_prefix=tool.&limit=100'

Filter by action, actor, time range, or instance.

Per-call telemetry

The control plane stores per-call samples (with BLAKE3-hashed error messages — no plaintext leakage) in the tool_invocations table. Hourly + daily rollups serve the metrics endpoint and the dashboard.

For Enterprise customers with payload_capture entitled, request and response payloads are encrypted per-tenant using AES-256-GCM and stored in tool_invocation_payloads. The dashboard surfaces them with tenant-bound decrypt-on-view.

Drilling down

The dashboard's Tool activity view lets you:

Filter by plugin, tool, outcome, or time range
Drill from a metric anomaly to the specific samples that contributed
Open an audit trace for any sample showing identity → policy → dispatch
Compare arrival times to processing times (lag detection)

Alerting

Wire your Prometheus → Alertmanager pipeline as usual. Common alerts:

yaml

- alert: MCPGToolErrorRate
  expr: |
    sum(rate(mcpg_tool_errors_total[5m])) by (plugin, tool)
    / sum(rate(mcpg_tool_calls_total[5m])) by (plugin, tool)
    > 0.05
  for: 5m
  annotations:
    summary: "High error rate on {{ $labels.plugin }}.{{ $labels.tool }}"

See observability deep-dive for the architecture behind the metrics pipeline.