GKE for stateful AI workloads — the patterns that survived production

What’s stateful

A multi-agent stack typically has:

Vector index (pgvector / AlloyDB / Pinecone client). Stateful at the DB layer; stateless at the app layer (most of the time).
Chat history (Postgres). Stateful at the DB layer.
Agent memory (Redis or Postgres). State.
Model files (Ollama / vLLM cache). Stateful on disk per node.
Tool registries (in-memory). Stateless per instance but conventionally rebuilt at startup.

Three of those (model files, agent memory, sometimes vector index) want StatefulSets and PVCs.

The patterns

StatefulSet for ordered identity. Pod names are predictable (my-agent-0, my-agent-1); each gets its own PVC. Restarting my-agent-0 mounts the same volume — the model cache survives.

apiVersion: apps/v1
kind: StatefulSet
metadata: { name: ollama }
spec:
  replicas: 3
  serviceName: ollama
  template:
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        volumeMounts:
        - { name: model-cache, mountPath: /root/.ollama }
  volumeClaimTemplates:
  - metadata: { name: model-cache }
    spec:
      accessModes: [ReadWriteOnce]
      resources: { requests: { storage: 100Gi } }
      storageClassName: pd-ssd

Headless service for direct pod addressing. Each pod is reachable at my-agent-0.svc.cluster.local. Useful for sticky routing to a specific instance.

Gateway API for traffic management. Replaces Ingress with a typed, role-separated API. The platform team manages the Gateway; application teams manage HTTPRoutes against it.

GPU node pools, taint + tolerate. AI workloads need GPUs; not every pod needs them. Dedicate GPU nodes to GPU pods via taints; everything else lands on CPU pools.

What broke

PodDisruptionBudgets too strict. Setting maxUnavailable: 0 on a 3-replica deployment means the cluster autoscaler can’t drain a node. Result: stuck cluster scaling. Set maxUnavailable: 1; tolerate the brief degradation.

Persistent volume cleanup. Deleting a StatefulSet doesn’t delete the PVCs. After enough iterations, your project’s PD-SSD quota fills up with orphaned volumes. Add a Helm post-uninstall job that cleans up, or use whenDeleted: Delete policy.

Cluster autoscaler choosing the wrong node pool. A pod requesting 8 GB of memory might fit on a CPU pool node; the autoscaler scales the CPU pool. But the pod has a GPU requirement → never schedules. Pre-provision a small GPU pool warm; let the autoscaler scale within it.

Cost containment

GKE for AI workloads costs more than Cloud Run for the same compute. Justification:

Persistent volumes for model caches (Cloud Run is stateless).
Custom networking (Cloud Run is fronted by Google’s edge).
GPU support (Cloud Run GPU is newer and more limited).
Predictable per-instance state (Cloud Run scales to zero).

For each workload, run the cost analysis: is the state worth the GKE tax? For some workloads (high-volume inference), Cloud Run scaling is dramatically cheaper. For others (long-running agents with state), GKE pays back.

Hybrid is common: Cloud Run for stateless inference; GKE for stateful coordination.

What I’d carry forward

For multi-agent AI workloads on GKE:

StatefulSets for anything with disk-resident state.
Gateway API over Ingress for new deployments.
GPU node pools dedicated with taints.
Cluster autoscaler with PDBs that allow at least one disruption.
Cost analysis vs Cloud Run per workload; hybrid is normal.

The patterns aren’t novel; the cost of getting them wrong is. Test on staging; budget for surprises.