Kubernetes operator
Deploy and run MCPG on Kubernetes with the operator and Helm chart — the eight mcpg.dev CRDs, the validating admission webhook, cert-manager TLS, metrics and probe endpoints, and the air-gap path.
The MCPG operator reconciles declarative custom resources into the Kubernetes objects a
gateway fleet needs — Deployment, Service, ConfigMap, ServiceAccount — and validates
every CRD write through an admission webhook. This page covers installing it with Helm and the
day-2 surfaces an SRE manages. For the field-by-field CRD schemas, see
Operator CRDs reference; for the end-to-end install
walkthrough, the Kubernetes install guide.
What the operator is
- API group / version:
mcpg.dev/v1alpha2. - Eight CRDs (below) — gateways, plugins, plugin sets, revocation lists, clusters, routes, tenants, and plugin mirrors.
- A validating admission webhook (validating only — it never mutates your resources).
- Two endpoints: metrics + health on
:8443, the webhook server on:9443.
The webhook fails closed by default (failurePolicy: Fail): if the operator is
unreachable, CRD edits are blocked. That is the correct posture for correctness, and it means
the webhook's TLS must be real — use cert-manager (recommended) or pre-provision the Secret.
Install with Helm
helm install mcpg-operator ./helm/charts/mcpg-operator \
--namespace mcpg-system \
--create-namespace \
--set certManager.enabled=true
Verify the operator is up:
kubectl -n mcpg-system get pods
# mcpg-operator-7d8c4b9b8-xj2pq 1/1 Running
The chart is not yet published to a public registry — install from a local checkout for the beta. The admission webhook fails closed, so enable cert-manager or pre-provision
tls.secretName.
Prerequisites
- Kubernetes 1.28+.
- cert-manager 1.13+ for webhook TLS (recommended).
- prometheus-operator if you set
serviceMonitor.enabled=true.
Key values
| Group | Purpose |
|---|---|
image.* | Operator container image (defaults to ghcr.io/mcpg-dev/source-code/mcpg-operator:<appVersion>). |
webhook.* | Validating webhook — failurePolicy (default Fail), bindAddress (0.0.0.0:9443), servicePort. |
certManager.* | Auto TLS via cert-manager. Recommended for production. |
tls.secretName | Pre-provisioned TLS Secret (when cert-manager is disabled). |
operator.* | Runtime config — watchNamespace, logFilter, logFormat, resyncIntervalSecs, reconcileConcurrency. |
metrics.port | Operator metrics + /healthz + /readyz (default 8443). |
serviceMonitor.* | prometheus-operator integration. |
prometheusRule.* | Bundled alert pack (reconcile staleness / error rate / p95 latency / OCI pull failures). |
grafanaDashboard.* | ConfigMap-delivered Grafana dashboard JSON. |
networkPolicy.* | Default-deny posture; set egress CIDRs for OCI registries + Sigstore. |
resources.* | Operator pod caps (default 100m/128Mi → 1/512Mi). |
Pre-tuned values-medium.yaml and values-large.yaml ship alongside the defaults.
Replica count
The operator runs at replicaCount: 1 during the beta — leader-election scaffolding is in
place but multi-replica HA is not yet wired. A PodDisruptionBudget is only rendered when
replicaCount > 1 (a PDB on a single-replica Deployment blocks every node drain).
The eight CRDs
The chart ships all eight under crds/. Three are namespaced (the user-facing resources a
tenant manages); five are cluster-scoped (fleet-wide artifacts and policy).
| CRD | Scope | Purpose |
|---|---|---|
MCPGGateway | Namespaced | A gateway deployment. spec.config is the gateway's own AppConfig verbatim. |
MCPGPluginSet | Namespaced | An ordered list of plugin entries bound to a gateway. |
MCPGRoute | Namespaced | Per-tenant routing into a shared gateway (soft multi-tenancy). |
MCPGPlugin | Cluster | A distributable plugin artifact (OCI ref + signature). |
MCPGCluster | Cluster | A shared cluster-coordination backend referenced by gateways. |
MCPGTenant | Cluster | A declarative tenant boundary — owned namespaces, plugin allowlist, quotas. |
MCPGRevocationList | Cluster | Revoked plugin digests / signers, enforced fail-closed. |
MCPGPluginMirror | Cluster | In-cluster OCI mirror for air-gapped plugin distribution. |
Full schemas, status conditions, and examples are in the Operator CRDs reference.
CRD lifecycle
Per Helm's rules, files under crds/ are installed on the first helm install (existing
CRDs are left untouched), but Helm never upgrades or deletes them on helm upgrade /
helm uninstall. To pick up CRD schema changes, re-apply them out-of-band:
kubectl apply -f helm/charts/mcpg-operator/crds/
helm upgrade mcpg-operator ./helm/charts/mcpg-operator -n mcpg-system --reuse-values
There is no automatic CRD pre-upgrade hook. For an independent CRD lifecycle (GitOps / OLM), apply the CRDs before the chart; Helm then skips the pre-existing ones.
Managing a gateway via CRD
MCPGGateway.spec.config is the gateway's own AppConfig schema verbatim (snake_case,
deny_unknown_fields) — the same shape the standalone gateway boots from. The operator is
schema-blind here: it passes spec.config straight through, so validate it with
mcpg-config-check before committing.
apiVersion: mcpg.dev/v1alpha2
kind: MCPGGateway
metadata:
name: gw
namespace: mcpg
spec:
replicas: 3
pluginSetRef:
name: production
config:
gateway:
server:
bind_address: "0.0.0.0:8787"
allowed_origins:
- "https://gateway.example.com"
cluster:
kind: nats
servers: ["nats://nats.mcpg.svc:4222"]
node:
id: "${env.HOSTNAME}"
mcp:
capabilities:
tools:
- name: api.account.lookup
description: Look up an account by id.
backend:
kind: http
url: "https://accounts.internal.example.com/v1/accounts/${$arguments.account_id}"
method: get
expected_status_codes: [200]
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
Tools live under spec.config.mcp.capabilities.tools[] — see
Deployment topologies for the per-shape config and
Clustering for the cluster-backend keys.
Monitoring the operator
The operator exposes its own metrics on :8443/metrics, with liveness at /healthz and
readiness at /readyz (both HTTPS). Wire them up:
# values.yaml
serviceMonitor:
enabled: true # requires prometheus-operator CRDs
prometheusRule:
enabled: true # ships the default operator alert pack
grafanaDashboard:
enabled: true # ConfigMap auto-discovered by a Grafana sidecar
The bundled PrometheusRule alerts on reconcile staleness, transient-error rate, p95
reconcile latency, and OCI-pull failure rate — tune the thresholds under
prometheusRule.thresholds. This is distinct from the gateway's own observability —
gateway metrics / traces / logs are covered in Observability.
Hardening and air-gap
The chart ships a hardened pod posture out of the box: non-root (65534), read-only root
filesystem, all capabilities dropped, seccompProfile: RuntimeDefault.
- NetworkPolicy (
networkPolicy.enabled) applies a default-deny posture: ingress only from kube-apiserver (webhook) and Prometheus (scrape), egress to apiserver + DNS plus the CIDRs you list. Note: OCI plugin pulls and cosign verification will fail unless you setnetworkPolicy.egress.cidrsto your registry / Sigstore reachability ranges. - Air-gap: mount a pre-mirrored Sigstore trust root via
operator.sigstoreTrustRoot.enabledso cosign keyless verification works with no network, and useMCPGPluginMirrorfor an in-cluster OCI mirror with fail-closed pull rewriting.
What's next
- Operator CRDs reference — every CRD's schema + status conditions
- Kubernetes install guide — end-to-end walkthrough
- Clustering —
cluster.kindkeys the operator wires - Configuration reference — the
spec.configschema