Kubernetes operator

Deploy and run MCPG on Kubernetes with the operator and Helm chart — the eight mcpg.dev CRDs, the validating admission webhook, cert-manager TLS, metrics and probe endpoints, and the air-gap path.

The MCPG operator reconciles declarative custom resources into the Kubernetes objects a gateway fleet needs — Deployment, Service, ConfigMap, ServiceAccount — and validates every CRD write through an admission webhook. This page covers installing it with Helm and the day-2 surfaces an SRE manages. For the field-by-field CRD schemas, see Operator CRDs reference; for the end-to-end install walkthrough, the Kubernetes install guide.

What the operator is

API group / version: mcpg.dev/v1alpha2.
Eight CRDs (below) — gateways, plugins, plugin sets, revocation lists, clusters, routes, tenants, and plugin mirrors.
A validating admission webhook (validating only — it never mutates your resources).
Two endpoints: metrics + health on :8443, the webhook server on :9443.

The webhook fails closed by default (failurePolicy: Fail): if the operator is unreachable, CRD edits are blocked. That is the correct posture for correctness, and it means the webhook's TLS must be real — use cert-manager (recommended) or pre-provision the Secret.

Install with Helm

bash

helm install mcpg-operator ./helm/charts/mcpg-operator \
  --namespace mcpg-system \
  --create-namespace \
  --set certManager.enabled=true

Verify the operator is up:

bash

kubectl -n mcpg-system get pods
# mcpg-operator-7d8c4b9b8-xj2pq   1/1   Running

The chart is not yet published to a public registry — install from a local checkout for the beta. The admission webhook fails closed, so enable cert-manager or pre-provision tls.secretName.

Prerequisites

Kubernetes 1.28+.
cert-manager 1.13+ for webhook TLS (recommended).
prometheus-operator if you set serviceMonitor.enabled=true.

Key values

Group	Purpose
`image.*`	Operator container image (defaults to `ghcr.io/mcpg-dev/source-code/mcpg-operator:<appVersion>`).
`webhook.*`	Validating webhook — `failurePolicy` (default `Fail`), `bindAddress` (`0.0.0.0:9443`), `servicePort`.
`certManager.*`	Auto TLS via cert-manager. Recommended for production.
`tls.secretName`	Pre-provisioned TLS Secret (when cert-manager is disabled).
`operator.*`	Runtime config — `watchNamespace`, `logFilter`, `logFormat`, `resyncIntervalSecs`, `reconcileConcurrency`.
`metrics.port`	Operator metrics + `/healthz` + `/readyz` (default `8443`).
`serviceMonitor.*`	prometheus-operator integration.
`prometheusRule.*`	Bundled alert pack (reconcile staleness / error rate / p95 latency / OCI pull failures).
`grafanaDashboard.*`	ConfigMap-delivered Grafana dashboard JSON.
`networkPolicy.*`	Default-deny posture; set egress CIDRs for OCI registries + Sigstore.
`resources.*`	Operator pod caps (default `100m`/`128Mi` → `1`/`512Mi`).

Pre-tuned values-medium.yaml and values-large.yaml ship alongside the defaults.

Replica count

The operator runs at replicaCount: 1 during the beta — leader-election scaffolding is in place but multi-replica HA is not yet wired. A PodDisruptionBudget is only rendered when replicaCount > 1 (a PDB on a single-replica Deployment blocks every node drain).

The eight CRDs

The chart ships all eight under crds/. Three are namespaced (the user-facing resources a tenant manages); five are cluster-scoped (fleet-wide artifacts and policy).

CRD	Scope	Purpose
`MCPGGateway`	Namespaced	A gateway deployment. `spec.config` is the gateway's own `AppConfig` verbatim.
`MCPGPluginSet`	Namespaced	An ordered list of plugin entries bound to a gateway.
`MCPGRoute`	Namespaced	Per-tenant routing into a shared gateway (soft multi-tenancy).
`MCPGPlugin`	Cluster	A distributable plugin artifact (OCI ref + signature).
`MCPGCluster`	Cluster	A shared cluster-coordination backend referenced by gateways.
`MCPGTenant`	Cluster	A declarative tenant boundary — owned namespaces, plugin allowlist, quotas.
`MCPGRevocationList`	Cluster	Revoked plugin digests / signers, enforced fail-closed.
`MCPGPluginMirror`	Cluster	In-cluster OCI mirror for air-gapped plugin distribution.

Full schemas, status conditions, and examples are in the Operator CRDs reference.

CRD lifecycle

Per Helm's rules, files under crds/ are installed on the first helm install (existing CRDs are left untouched), but Helm never upgrades or deletes them on helm upgrade / helm uninstall. To pick up CRD schema changes, re-apply them out-of-band:

bash

kubectl apply -f helm/charts/mcpg-operator/crds/
helm upgrade mcpg-operator ./helm/charts/mcpg-operator -n mcpg-system --reuse-values

There is no automatic CRD pre-upgrade hook. For an independent CRD lifecycle (GitOps / OLM), apply the CRDs before the chart; Helm then skips the pre-existing ones.

Managing a gateway via CRD

MCPGGateway.spec.config is the gateway's own AppConfig schema verbatim (snake_case, deny_unknown_fields) — the same shape the standalone gateway boots from. The operator is schema-blind here: it passes spec.config straight through, so validate it with mcpg-config-check before committing.

yaml

apiVersion: mcpg.dev/v1alpha2
kind: MCPGGateway
metadata:
  name: gw
  namespace: mcpg
spec:
  replicas: 3
  pluginSetRef:
    name: production
  config:
    gateway:
      server:
        bind_address: "0.0.0.0:8787"
        allowed_origins:
          - "https://gateway.example.com"
    cluster:
      kind: nats
      servers: ["nats://nats.mcpg.svc:4222"]
      node:
        id: "${env.HOSTNAME}"
    mcp:
      capabilities:
        tools:
          - name: api.account.lookup
            description: Look up an account by id.
            backend:
              kind: http
              url: "https://accounts.internal.example.com/v1/accounts/${$arguments.account_id}"
              method: get
              expected_status_codes: [200]
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10

Tools live under spec.config.mcp.capabilities.tools[] — see Deployment topologies for the per-shape config and Clustering for the cluster-backend keys.

Monitoring the operator

The operator exposes its own metrics on :8443/metrics, with liveness at /healthz and readiness at /readyz (both HTTPS). Wire them up:

yaml

# values.yaml
serviceMonitor:
  enabled: true       # requires prometheus-operator CRDs
prometheusRule:
  enabled: true       # ships the default operator alert pack
grafanaDashboard:
  enabled: true       # ConfigMap auto-discovered by a Grafana sidecar

The bundled PrometheusRule alerts on reconcile staleness, transient-error rate, p95 reconcile latency, and OCI-pull failure rate — tune the thresholds under prometheusRule.thresholds. This is distinct from the gateway's own observability — gateway metrics / traces / logs are covered in Observability.

Hardening and air-gap

The chart ships a hardened pod posture out of the box: non-root (65534), read-only root filesystem, all capabilities dropped, seccompProfile: RuntimeDefault.

NetworkPolicy (networkPolicy.enabled) applies a default-deny posture: ingress only from kube-apiserver (webhook) and Prometheus (scrape), egress to apiserver + DNS plus the CIDRs you list. Note: OCI plugin pulls and cosign verification will fail unless you set networkPolicy.egress.cidrs to your registry / Sigstore reachability ranges.
Air-gap: mount a pre-mirrored Sigstore trust root via operator.sigstoreTrustRoot.enabled so cosign keyless verification works with no network, and use MCPGPluginMirror for an in-cluster OCI mirror with fail-closed pull rewriting.

What's next

Operator CRDs reference — every CRD's schema + status conditions
Kubernetes install guide — end-to-end walkthrough
Clustering — cluster.kind keys the operator wires
Configuration reference — the spec.config schema