Brownlow — vote integrity at broadcast scale

The AFL Brownlow Medal vote count is a televised event in Australia. The Brownlow platform let voters submit during the live broadcast. 100K+ votes, 10K+ concurrent users at peak, all inside a 2-hour window, with vote integrity that has to survive a regulator audit. Here is the architecture that shipped.

The load shape

A normal high-traffic system has a smooth distribution. Brownlow didn’t. The shape was:

Pre-broadcast: trickle of test traffic, ~10 RPS.
Broadcast start: ~5,000 RPS over 30 seconds. People who remembered to vote did it now.
Round-by-round spikes: ~2,000 RPS for the 60 seconds after each round’s votes were revealed.
Broadcast end: ~8,000 RPS rush at the final reveal.

The cluster had to scale from idle to 5K RPS in 30 seconds. Cloud Run’s per-pod concurrency + revision-level autoscale handled this shape better than GKE would have — no node scaling lag, just new container instances coming up.

Why Cloud Run, not GKE

The team had a GKE deployment available; Cloud Run was the explicit choice. Reasons:

Cold-to-hot speed. Cloud Run can scale to thousands of instances in tens of seconds. GKE needs node provisioning for the same shape and that takes minutes.
No pod / node management for the on-call team during the broadcast. The team’s attention was on vote integrity, not on Kubernetes resource pressures.
Per-request billing matched the load shape exactly. We paid for the broadcast window, not for a permanent cluster.

The cost was lock-in to Cloud Run’s runtime model (stateless HTTP/gRPC, no long-lived background processes). For this workload the lock-in was acceptable; the platform was rebuilt every season.

The vote-submission path

The hot path was tight:

voter → CDN → Cloud Run (Go) → KMS sign → Spanner write → CDN cache invalidate
                  │
                  └─► (async) Pub/Sub → analytics

Every step had a budget. Total p95: 280ms. The breakdown:

CDN: 12ms
Cloud Run handler: 90ms (most of it the KMS sign)
Spanner write: 60ms
CDN invalidate: 20ms
Wire time + headers: 98ms

The KMS sign was the biggest single component. Each vote was signed with a per-voter ephemeral key derived from a master key in Cloud KMS. The signature was the integrity artefact — proof that the vote was submitted from a verified session, not replayed or forged.

Cloud KMS for vote integrity

The signing key never left KMS. The Go service called Encrypt/Sign via the KMS API; the API returned the signature bytes; the bytes went into the Spanner row.

For audit, the verification flow was:

Read a vote row from Spanner.
Extract the signature.
Call KMS Verify with the original vote payload + signature.
Verify returns valid / invalid.

The auditor (the AFL’s integrity team) had read access to Spanner and read access to KMS verify. They could independently verify any vote without trusting the platform team.

That separation — the platform team can’t fake votes because they can’t fake signatures; the integrity team can verify without trusting the platform — was the entire point of using KMS.

Security Command Center

SCC ran continuously during the broadcast. The monitored controls:

API security findings: anything anomalous in the API responses (unexpected error rates, unusual headers).
IAM anomalies: any IAM change during the broadcast window (there shouldn’t be any).
Network anomalies: unusual traffic patterns at the load balancer.
Workload identity drift: Cloud Run service accounts shouldn’t change permissions.

SCC findings during the broadcast went to a dedicated Slack channel staffed by a 2-person security team. Most findings were noise (a Looker dashboard rebuilding, a backup job running). The non-noise findings — there were maybe 3 across the broadcast — were investigated within minutes.

What broke (and what didn’t)

Across the broadcast season, the platform had no integrity incidents. The things that went sideways:

CDN cache invalidation lag. A revealed-round vote tally was occasionally cached longer than intended; users saw stale tallies for ~5 seconds. Fix: shorter TTL on the tally endpoint, accepted the higher origin load.
A regional Cloud Run cold start spike. One region had a slow cold-start window during the broadcast start; latency spiked to 800ms for ~30 seconds. The CDN absorbed most user impact; the in-flight votes succeeded with retries. Fix: keep minimum instances warm in each region during the broadcast window.
A partner analytics consumer fell behind. The Pub/Sub topic backed up; the analytics dashboards lagged real-time by ~10 minutes during peak. No vote impact; the analytics team accepted the lag and added consumers for next season.

The off-season

Between seasons, the platform ran at idle. Cloud Run scaled to zero; Spanner stayed warm with minimal nodes; the KMS keys stayed in place.

Annual cost out-of-season was trivial. Annual cost during the broadcast windows was the bulk of the bill. The shape matched the business; we didn’t pay for a permanent infrastructure when the load was a six-week window per year.

What transfers

Three lessons from running a live-event workload:

Cold-to-hot scale speed beats sustained capacity for short-window high-throughput workloads. Cloud Run was the right shape; GKE wasn’t.
Move integrity-critical operations into managed services you can’t bypass. KMS signing made vote forgery impossible for us, not just impolite. The auditor could verify independently. That’s the strongest integrity story.
Run continuous security monitoring with a small dedicated team during the window. SCC’s findings during steady-state are noisy; during a high-stakes window, the signal-to-noise shifts and the findings matter.

The Brownlow platform was the most-watched piece of software I’ve shipped. The architecture wasn’t novel; the discipline around integrity and scale was the differentiator.