Observability deep-dive — from per-call sample to fleet metric

How MCPG turns 100k tool calls per minute into a Prometheus dashboard you can actually read.

The hardest thing about observability isn't getting the data — it's aggregating the data without losing the ability to drill back down to specific calls. This article walks through MCPG's pipeline from gateway dispatch to Prometheus metric, with a stop at every layer where decisions had to be made.

The pipeline at a glance

scss

[gateway dispatch]
  → ToolCallRecorder hook
  → MetricsBuffer (in-process bounded queue)
  → MetricsReport batch
  → CP gRPC channel
  → tool_invocations table (raw)
  → obs_rollup task (hourly + daily)
  → tool_rollups_hour / tool_rollups_day
  → /v1/orgs/{org}/metrics endpoint
  → Prometheus scrape

Eight stages. Each stage has a job. Let's walk them.

Stage 1: ToolCallRecorder hook

When the gateway finishes dispatching a tool call (success, policy denial, or error), it calls recorder.record_terminal_outcome(sample). The ToolCallRecorder trait is deliberately minimal:

rust

trait ToolCallRecorder: Send + Sync {
    fn record_terminal_outcome(&self, sample: ToolCallSample);
    fn payload_capture_enabled(&self) -> bool;
}

The default recorder is NoopRecorder — observability costs nothing if you're not running a control plane. Wire a real recorder via GatewayRuntime::set_tool_call_recorder(...).

The hook fires from all three terminal paths: direct dispatch, task-augmented (MCP Tasks), and pre-dispatch policy denial. This was a multi-month invariant to get right — early versions only recorded the happy path, which meant our error rate metrics looked artificially low.

Stage 2: MetricsBuffer

MetricsBuffer is a bounded parking_lot::Mutex<VecDeque<ToolCallSample>> with a flush interval (default 5s) and a soft cap (default 1000 samples). When samples arrive faster than the flush rate, they're dropped after the cap with a dropped counter incremented.

Why an in-process buffer:

The gateway must not block on CP availability — if the CP is down, the gateway keeps serving tool calls
5s batching reduces gRPC overhead by ~20× vs per-call streaming
Bounded queue keeps memory predictable under bursts

Why bound rather than expand: a runaway upstream tool generating 1M calls/sec shouldn't OOM the gateway.

Stage 3: MetricsReport batching

Every flush interval, the buffer drains into a MetricsReport proto message:

protobuf

message MetricsReport {
  google.protobuf.Timestamp at = 1;
  uint64 seq = 2;
  repeated ToolInvocationSample invocations = 3;
  uint64 dropped_overflow = 4;
}

seq is monotonic per-instance — the CP detects gaps and can request retransmit (currently optional; not yet wired). dropped_overflow is the count from stage 2, so the operator can see when their buffer is undersized.

Stage 4: gRPC channel

The MetricsReport rides the existing bidirectional Channel stream that the cp-client plugin maintains for config delivery. This means:

One TLS connection per gateway, no separate metrics pipe
Auth is the instance JWT — no separate metrics auth
If the channel is down, samples queue locally and ship on reconnect

Stage 5: tool_invocations (raw)

The CP receives MetricsReport, validates the JWT-claimed instance ID, and writes each sample to tool_invocations:

sql

CREATE TABLE tool_invocations (
  id INTEGER PRIMARY KEY,
  org_id TEXT NOT NULL,
  workspace_id TEXT NOT NULL,
  instance_id TEXT NOT NULL,
  plugin_id TEXT NOT NULL,
  tool_name TEXT NOT NULL,
  binding_id TEXT,
  ingested_at TIMESTAMP NOT NULL,
  started_at TIMESTAMP NOT NULL,
  duration_ms INTEGER NOT NULL,
  outcome TEXT NOT NULL,    -- ok | client_error | server_error | policy_denied | quota_exceeded
  error_code TEXT,
  error_hash TEXT,           -- BLAKE3 of error message; no plaintext
  request_id TEXT,
  caller_subject TEXT
);

A few decisions worth noting:

ingested_at is the authoritative clock — samples might arrive late but rollups bucket by ingest time, so they always land in a sensible bucket.
error_hash is BLAKE3 of the error message, not the message itself. Operators see "5 errors with hash abc123" without leaking what the errors said. The dashboard joins to a separate error_corpus table only if the operator opts in.
Per-row indexes on (org_id, plugin_id, tool_name, ingested_at) keep the drill-down view fast.

Stage 6: obs_rollup task

A leader-leased task runs every 5 minutes (configurable). It looks for the oldest hour bucket without a rollup and aggregates:

sql

INSERT INTO tool_rollups_hour
SELECT
  org_id, workspace_id, plugin_id, tool_name,
  date_trunc('hour', ingested_at) AS hour,
  COUNT(*) AS calls,
  COUNT(*) FILTER (WHERE outcome IN ('client_error','server_error')) AS errors,
  approx_quantile(duration_ms, 0.5) AS p50,
  approx_quantile(duration_ms, 0.95) AS p95,
  approx_quantile(duration_ms, 0.99) AS p99,
  MAX(duration_ms) AS max_ms
FROM tool_invocations
WHERE ingested_at >= ? AND ingested_at < ?
GROUP BY 1,2,3,4,5;

(SQLite uses t-digest rather than Postgres's native approx_quantile; the cp-core abstraction handles the difference.)

Daily rollups happen at midnight UTC, summing the hours.

The leader lease ensures only one replica runs the rollup at any moment, so multi-instance CPs don't double-count. See Clustering for the lease and coordinator model.

Stage 7: tool_rollups_hour / day

Once rolled up, raw samples can be retention-pruned (24h-7d depending on tier). Rollups themselves are kept much longer (90d-7yr).

This two-tier retention is on purpose: operators want long-term trend data without paying long-term detail storage costs.

Stage 8: Prometheus exposition

/v1/orgs/{org}/metrics reads from the rollup tables and emits Prometheus exposition format:

ini

# HELP mcpg_tool_calls_total Total tool invocations
# TYPE mcpg_tool_calls_total counter
mcpg_tool_calls_total{org="default",plugin="github",tool="list_repos"} 4827

# HELP mcpg_tool_latency_ms Tool latency quantiles
# TYPE mcpg_tool_latency_ms gauge
mcpg_tool_latency_ms{quantile="0.5",org="default",plugin="github",tool="list_repos"} 84

The endpoint is auth-bound (license JWT) — Prometheus configs include the bearer.

What we got wrong (and fixed)

Mistake 1: Bucket by started_at. Original implementation bucketed by the gateway-side started_at timestamp. Late-arriving samples (from a paused agent) landed in already-rolled-up buckets and were silently dropped. Switched to ingested_at and the problem vanished.

Mistake 2: Per-call gRPC. The first iteration streamed each sample as it happened. Latency on the gateway was fine, but the CP got hammered. Switching to 5s-batched MetricsReport cut CP CPU by ~95%.

Mistake 3: Storing error messages. The first iteration stored the full error text. PII risk. Switched to BLAKE3 hashes. Operators correlate by hash; the corpus table is opt-in for richer drill-down.

What you can do with the data

The dashboard's Tool activity view is built on the same APIs Prometheus uses, just rendered live. You can:

Filter by plugin, tool, outcome, time range
Sort by call count, error rate, p95 latency
Drill from a metric anomaly to the specific samples that contributed
Open an audit trace for any sample showing identity → policy → dispatch
Export anything as CSV or NDJSON

For Enterprise customers with payload_capture entitled, requests and responses are encrypted per-tenant and surfaced with tenant-bound decrypt-on-view. See the security model article for the encryption details.

That's the pipeline. Eight stages, ~3000 lines of Rust, designed so a single CP replica can ingest 100k calls/sec without breaking.