·2 min read·← All posts
GKE Kubernetes Multi-Agent AI Production

What’s stateful

A multi-agent stack typically has:

Three of those (model files, agent memory, sometimes vector index) want StatefulSets and PVCs.

The patterns

StatefulSet for ordered identity. Pod names are predictable (my-agent-0, my-agent-1); each gets its own PVC. Restarting my-agent-0 mounts the same volume — the model cache survives.

apiVersion: apps/v1
kind: StatefulSet
metadata: { name: ollama }
spec:
  replicas: 3
  serviceName: ollama
  template:
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        volumeMounts:
        - { name: model-cache, mountPath: /root/.ollama }
  volumeClaimTemplates:
  - metadata: { name: model-cache }
    spec:
      accessModes: [ReadWriteOnce]
      resources: { requests: { storage: 100Gi } }
      storageClassName: pd-ssd

Headless service for direct pod addressing. Each pod is reachable at my-agent-0.svc.cluster.local. Useful for sticky routing to a specific instance.

Gateway API for traffic management. Replaces Ingress with a typed, role-separated API. The platform team manages the Gateway; application teams manage HTTPRoutes against it.

GPU node pools, taint + tolerate. AI workloads need GPUs; not every pod needs them. Dedicate GPU nodes to GPU pods via taints; everything else lands on CPU pools.

What broke

PodDisruptionBudgets too strict. Setting maxUnavailable: 0 on a 3-replica deployment means the cluster autoscaler can’t drain a node. Result: stuck cluster scaling. Set maxUnavailable: 1; tolerate the brief degradation.

Persistent volume cleanup. Deleting a StatefulSet doesn’t delete the PVCs. After enough iterations, your project’s PD-SSD quota fills up with orphaned volumes. Add a Helm post-uninstall job that cleans up, or use whenDeleted: Delete policy.

Cluster autoscaler choosing the wrong node pool. A pod requesting 8 GB of memory might fit on a CPU pool node; the autoscaler scales the CPU pool. But the pod has a GPU requirement → never schedules. Pre-provision a small GPU pool warm; let the autoscaler scale within it.

Cost containment

GKE for AI workloads costs more than Cloud Run for the same compute. Justification:

For each workload, run the cost analysis: is the state worth the GKE tax? For some workloads (high-volume inference), Cloud Run scaling is dramatically cheaper. For others (long-running agents with state), GKE pays back.

Hybrid is common: Cloud Run for stateless inference; GKE for stateful coordination.

What I’d carry forward

For multi-agent AI workloads on GKE:

The patterns aren’t novel; the cost of getting them wrong is. Test on staging; budget for surprises.

← Back to all posts