The hardest thing about observability isn't getting the data — it's aggregating the data without losing the ability to drill back down to specific calls. This article walks through MCPG's pipeline from gateway dispatch to Prometheus metric, with a stop at every layer where decisions had to be made.
The pipeline at a glance
[gateway dispatch]
→ ToolCallRecorder hook
→ MetricsBuffer (in-process bounded queue)
→ MetricsReport batch
→ CP gRPC channel
→ tool_invocations table (raw)
→ obs_rollup task (hourly + daily)
→ tool_rollups_hour / tool_rollups_day
→ /v1/orgs/{org}/metrics endpoint
→ Prometheus scrape
Eight stages. Each stage has a job. Let's walk them.
Stage 1: ToolCallRecorder hook
When the gateway finishes dispatching a tool call (success, policy denial, or error),
it calls recorder.record_terminal_outcome(sample). The ToolCallRecorder trait is
deliberately minimal:
trait ToolCallRecorder: Send + Sync {
fn record_terminal_outcome(&self, sample: ToolCallSample);
fn payload_capture_enabled(&self) -> bool;
}
The default recorder is NoopRecorder — observability costs nothing if you're not
running a control plane. Wire a real recorder via
GatewayRuntime::set_tool_call_recorder(...).
The hook fires from all three terminal paths: direct dispatch, task-augmented (MCP Tasks), and pre-dispatch policy denial. This was a multi-month invariant to get right — early versions only recorded the happy path, which meant our error rate metrics looked artificially low.
Stage 2: MetricsBuffer
MetricsBuffer is a bounded parking_lot::Mutex<VecDeque<ToolCallSample>> with a
flush interval (default 5s) and a soft cap (default 1000 samples). When samples
arrive faster than the flush rate, they're dropped after the cap with a dropped
counter incremented.
Why an in-process buffer:
- The gateway must not block on CP availability — if the CP is down, the gateway keeps serving tool calls
- 5s batching reduces gRPC overhead by ~20× vs per-call streaming
- Bounded queue keeps memory predictable under bursts
Why bound rather than expand: a runaway upstream tool generating 1M calls/sec shouldn't OOM the gateway.
Stage 3: MetricsReport batching
Every flush interval, the buffer drains into a MetricsReport proto message:
message MetricsReport {
google.protobuf.Timestamp at = 1;
uint64 seq = 2;
repeated ToolInvocationSample invocations = 3;
uint64 dropped_overflow = 4;
}
seq is monotonic per-instance — the CP detects gaps and can request retransmit
(currently optional; not yet wired). dropped_overflow is the count from stage 2,
so the operator can see when their buffer is undersized.
Stage 4: gRPC channel
The MetricsReport rides the existing bidirectional Channel stream that the cp-client
plugin maintains for config delivery. This means:
- One TLS connection per gateway, no separate metrics pipe
- Auth is the instance JWT — no separate metrics auth
- If the channel is down, samples queue locally and ship on reconnect
Stage 5: tool_invocations (raw)
The CP receives MetricsReport, validates the JWT-claimed instance ID, and writes
each sample to tool_invocations:
CREATE TABLE tool_invocations (
id INTEGER PRIMARY KEY,
org_id TEXT NOT NULL,
workspace_id TEXT NOT NULL,
instance_id TEXT NOT NULL,
plugin_id TEXT NOT NULL,
tool_name TEXT NOT NULL,
binding_id TEXT,
ingested_at TIMESTAMP NOT NULL,
started_at TIMESTAMP NOT NULL,
duration_ms INTEGER NOT NULL,
outcome TEXT NOT NULL, -- ok | client_error | server_error | policy_denied | quota_exceeded
error_code TEXT,
error_hash TEXT, -- BLAKE3 of error message; no plaintext
request_id TEXT,
caller_subject TEXT
);
A few decisions worth noting:
ingested_atis the authoritative clock — samples might arrive late but rollups bucket by ingest time, so they always land in a sensible bucket.error_hashis BLAKE3 of the error message, not the message itself. Operators see "5 errors with hashabc123" without leaking what the errors said. The dashboard joins to a separateerror_corpustable only if the operator opts in.- Per-row indexes on
(org_id, plugin_id, tool_name, ingested_at)keep the drill-down view fast.
Stage 6: obs_rollup task
A leader-leased task runs every 5 minutes (configurable). It looks for the oldest hour bucket without a rollup and aggregates:
INSERT INTO tool_rollups_hour
SELECT
org_id, workspace_id, plugin_id, tool_name,
date_trunc('hour', ingested_at) AS hour,
COUNT(*) AS calls,
COUNT(*) FILTER (WHERE outcome IN ('client_error','server_error')) AS errors,
approx_quantile(duration_ms, 0.5) AS p50,
approx_quantile(duration_ms, 0.95) AS p95,
approx_quantile(duration_ms, 0.99) AS p99,
MAX(duration_ms) AS max_ms
FROM tool_invocations
WHERE ingested_at >= ? AND ingested_at < ?
GROUP BY 1,2,3,4,5;
(SQLite uses t-digest rather than Postgres's native approx_quantile; the cp-core
abstraction handles the difference.)
Daily rollups happen at midnight UTC, summing the hours.
The leader lease ensures only one replica runs the rollup at any moment, so multi-instance CPs don't double-count. See the multi-instance correctness article for the lease implementation.
Stage 7: tool_rollups_hour / day
Once rolled up, raw samples can be retention-pruned (24h-7d depending on tier). Rollups themselves are kept much longer (90d-7yr).
This two-tier retention is on purpose: operators want long-term trend data without paying long-term detail storage costs.
Stage 8: Prometheus exposition
/v1/orgs/{org}/metrics reads from the rollup tables and emits Prometheus
exposition format:
# HELP mcpg_tool_calls_total Total tool invocations
# TYPE mcpg_tool_calls_total counter
mcpg_tool_calls_total{org="default",plugin="github",tool="list_repos"} 4827
# HELP mcpg_tool_latency_ms Tool latency quantiles
# TYPE mcpg_tool_latency_ms gauge
mcpg_tool_latency_ms{quantile="0.5",org="default",plugin="github",tool="list_repos"} 84
The endpoint is auth-bound (license JWT) — Prometheus configs include the bearer.
What we got wrong (and fixed)
Mistake 1: Bucket by started_at. Original implementation bucketed by the
gateway-side started_at timestamp. Late-arriving samples (from a paused agent)
landed in already-rolled-up buckets and were silently dropped. Switched to
ingested_at and the problem vanished.
Mistake 2: Per-call gRPC. The first iteration streamed each sample as it
happened. Latency on the gateway was fine, but the CP got hammered. Switching to
5s-batched MetricsReport cut CP CPU by ~95%.
Mistake 3: Storing error messages. The first iteration stored the full error text. PII risk. Switched to BLAKE3 hashes. Operators correlate by hash; the corpus table is opt-in for richer drill-down.
What you can do with the data
The dashboard's Tool activity view is built on the same APIs Prometheus uses, just rendered live. You can:
- Filter by plugin, tool, outcome, time range
- Sort by call count, error rate, p95 latency
- Drill from a metric anomaly to the specific samples that contributed
- Open an audit trace for any sample showing identity → policy → dispatch
- Export anything as CSV or NDJSON
For Enterprise customers with payload_capture entitled, requests and responses
are encrypted per-tenant and surfaced with tenant-bound decrypt-on-view. See the
security model article for the encryption details.
That's the pipeline. Eight stages, ~3000 lines of Rust, designed so a single CP replica can ingest 100k calls/sec without breaking.