MCPG
Operations
Operations9 min

Kubernetes operator

Deploy and run MCPG on Kubernetes with the operator and Helm chart — the eight mcpg.dev CRDs, the validating admission webhook, cert-manager TLS, metrics and probe endpoints, and the air-gap path.

The MCPG operator reconciles declarative custom resources into the Kubernetes objects a gateway fleet needs — Deployment, Service, ConfigMap, ServiceAccount — and validates every CRD write through an admission webhook. This page covers installing it with Helm and the day-2 surfaces an SRE manages. For the field-by-field CRD schemas, see Operator CRDs reference; for the end-to-end install walkthrough, the Kubernetes install guide.

What the operator is

  • API group / version: mcpg.dev/v1alpha2.
  • Eight CRDs (below) — gateways, plugins, plugin sets, revocation lists, clusters, routes, tenants, and plugin mirrors.
  • A validating admission webhook (validating only — it never mutates your resources).
  • Two endpoints: metrics + health on :8443, the webhook server on :9443.

The webhook fails closed by default (failurePolicy: Fail): if the operator is unreachable, CRD edits are blocked. That is the correct posture for correctness, and it means the webhook's TLS must be real — use cert-manager (recommended) or pre-provision the Secret.

Install with Helm

bash
helm install mcpg-operator ./helm/charts/mcpg-operator \
  --namespace mcpg-system \
  --create-namespace \
  --set certManager.enabled=true

Verify the operator is up:

bash
kubectl -n mcpg-system get pods
# mcpg-operator-7d8c4b9b8-xj2pq   1/1   Running

The chart is not yet published to a public registry — install from a local checkout for the beta. The admission webhook fails closed, so enable cert-manager or pre-provision tls.secretName.

Prerequisites

  • Kubernetes 1.28+.
  • cert-manager 1.13+ for webhook TLS (recommended).
  • prometheus-operator if you set serviceMonitor.enabled=true.

Key values

GroupPurpose
image.*Operator container image (defaults to ghcr.io/mcpg-dev/source-code/mcpg-operator:<appVersion>).
webhook.*Validating webhook — failurePolicy (default Fail), bindAddress (0.0.0.0:9443), servicePort.
certManager.*Auto TLS via cert-manager. Recommended for production.
tls.secretNamePre-provisioned TLS Secret (when cert-manager is disabled).
operator.*Runtime config — watchNamespace, logFilter, logFormat, resyncIntervalSecs, reconcileConcurrency.
metrics.portOperator metrics + /healthz + /readyz (default 8443).
serviceMonitor.*prometheus-operator integration.
prometheusRule.*Bundled alert pack (reconcile staleness / error rate / p95 latency / OCI pull failures).
grafanaDashboard.*ConfigMap-delivered Grafana dashboard JSON.
networkPolicy.*Default-deny posture; set egress CIDRs for OCI registries + Sigstore.
resources.*Operator pod caps (default 100m/128Mi1/512Mi).

Pre-tuned values-medium.yaml and values-large.yaml ship alongside the defaults.

Replica count

The operator runs at replicaCount: 1 during the beta — leader-election scaffolding is in place but multi-replica HA is not yet wired. A PodDisruptionBudget is only rendered when replicaCount > 1 (a PDB on a single-replica Deployment blocks every node drain).

The eight CRDs

The chart ships all eight under crds/. Three are namespaced (the user-facing resources a tenant manages); five are cluster-scoped (fleet-wide artifacts and policy).

CRDScopePurpose
MCPGGatewayNamespacedA gateway deployment. spec.config is the gateway's own AppConfig verbatim.
MCPGPluginSetNamespacedAn ordered list of plugin entries bound to a gateway.
MCPGRouteNamespacedPer-tenant routing into a shared gateway (soft multi-tenancy).
MCPGPluginClusterA distributable plugin artifact (OCI ref + signature).
MCPGClusterClusterA shared cluster-coordination backend referenced by gateways.
MCPGTenantClusterA declarative tenant boundary — owned namespaces, plugin allowlist, quotas.
MCPGRevocationListClusterRevoked plugin digests / signers, enforced fail-closed.
MCPGPluginMirrorClusterIn-cluster OCI mirror for air-gapped plugin distribution.

Full schemas, status conditions, and examples are in the Operator CRDs reference.

CRD lifecycle

Per Helm's rules, files under crds/ are installed on the first helm install (existing CRDs are left untouched), but Helm never upgrades or deletes them on helm upgrade / helm uninstall. To pick up CRD schema changes, re-apply them out-of-band:

bash
kubectl apply -f helm/charts/mcpg-operator/crds/
helm upgrade mcpg-operator ./helm/charts/mcpg-operator -n mcpg-system --reuse-values

There is no automatic CRD pre-upgrade hook. For an independent CRD lifecycle (GitOps / OLM), apply the CRDs before the chart; Helm then skips the pre-existing ones.

Managing a gateway via CRD

MCPGGateway.spec.config is the gateway's own AppConfig schema verbatim (snake_case, deny_unknown_fields) — the same shape the standalone gateway boots from. The operator is schema-blind here: it passes spec.config straight through, so validate it with mcpg-config-check before committing.

yaml
apiVersion: mcpg.dev/v1alpha2
kind: MCPGGateway
metadata:
  name: gw
  namespace: mcpg
spec:
  replicas: 3
  pluginSetRef:
    name: production
  config:
    gateway:
      server:
        bind_address: "0.0.0.0:8787"
        allowed_origins:
          - "https://gateway.example.com"
    cluster:
      kind: nats
      servers: ["nats://nats.mcpg.svc:4222"]
      node:
        id: "${env.HOSTNAME}"
    mcp:
      capabilities:
        tools:
          - name: api.account.lookup
            description: Look up an account by id.
            backend:
              kind: http
              url: "https://accounts.internal.example.com/v1/accounts/${$arguments.account_id}"
              method: get
              expected_status_codes: [200]
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10

Tools live under spec.config.mcp.capabilities.tools[] — see Deployment topologies for the per-shape config and Clustering for the cluster-backend keys.

Monitoring the operator

The operator exposes its own metrics on :8443/metrics, with liveness at /healthz and readiness at /readyz (both HTTPS). Wire them up:

yaml
# values.yaml
serviceMonitor:
  enabled: true       # requires prometheus-operator CRDs
prometheusRule:
  enabled: true       # ships the default operator alert pack
grafanaDashboard:
  enabled: true       # ConfigMap auto-discovered by a Grafana sidecar

The bundled PrometheusRule alerts on reconcile staleness, transient-error rate, p95 reconcile latency, and OCI-pull failure rate — tune the thresholds under prometheusRule.thresholds. This is distinct from the gateway's own observability — gateway metrics / traces / logs are covered in Observability.

Hardening and air-gap

The chart ships a hardened pod posture out of the box: non-root (65534), read-only root filesystem, all capabilities dropped, seccompProfile: RuntimeDefault.

  • NetworkPolicy (networkPolicy.enabled) applies a default-deny posture: ingress only from kube-apiserver (webhook) and Prometheus (scrape), egress to apiserver + DNS plus the CIDRs you list. Note: OCI plugin pulls and cosign verification will fail unless you set networkPolicy.egress.cidrs to your registry / Sigstore reachability ranges.
  • Air-gap: mount a pre-mirrored Sigstore trust root via operator.sigstoreTrustRoot.enabled so cosign keyless verification works with no network, and use MCPGPluginMirror for an in-cluster OCI mirror with fail-closed pull rewriting.

What's next