Brownlow — vote integrity at broadcast scale
The AFL Brownlow Medal vote count is a televised event in Australia. The Brownlow platform let voters submit during the live broadcast. 100K+ votes, 10K+ concurrent users at peak, all inside a 2-hour window, with vote integrity that has to survive a regulator audit. Here is the architecture that shipped.
The load shape
A normal high-traffic system has a smooth distribution. Brownlow didn’t. The shape was:
- Pre-broadcast: trickle of test traffic, ~10 RPS.
- Broadcast start: ~5,000 RPS over 30 seconds. People who remembered to vote did it now.
- Round-by-round spikes: ~2,000 RPS for the 60 seconds after each round’s votes were revealed.
- Broadcast end: ~8,000 RPS rush at the final reveal.
The cluster had to scale from idle to 5K RPS in 30 seconds. Cloud Run’s per-pod concurrency + revision-level autoscale handled this shape better than GKE would have — no node scaling lag, just new container instances coming up.
Why Cloud Run, not GKE
The team had a GKE deployment available; Cloud Run was the explicit choice. Reasons:
- Cold-to-hot speed. Cloud Run can scale to thousands of instances in tens of seconds. GKE needs node provisioning for the same shape and that takes minutes.
- No pod / node management for the on-call team during the broadcast. The team’s attention was on vote integrity, not on Kubernetes resource pressures.
- Per-request billing matched the load shape exactly. We paid for the broadcast window, not for a permanent cluster.
The cost was lock-in to Cloud Run’s runtime model (stateless HTTP/gRPC, no long-lived background processes). For this workload the lock-in was acceptable; the platform was rebuilt every season.
The vote-submission path
The hot path was tight:
voter → CDN → Cloud Run (Go) → KMS sign → Spanner write → CDN cache invalidate
│
└─► (async) Pub/Sub → analytics
Every step had a budget. Total p95: 280ms. The breakdown:
- CDN: 12ms
- Cloud Run handler: 90ms (most of it the KMS sign)
- Spanner write: 60ms
- CDN invalidate: 20ms
- Wire time + headers: 98ms
The KMS sign was the biggest single component. Each vote was signed with a per-voter ephemeral key derived from a master key in Cloud KMS. The signature was the integrity artefact — proof that the vote was submitted from a verified session, not replayed or forged.
Cloud KMS for vote integrity
The signing key never left KMS. The Go service called
Encrypt/Sign via the KMS API; the API returned the signature
bytes; the bytes went into the Spanner row.
For audit, the verification flow was:
- Read a vote row from Spanner.
- Extract the signature.
- Call KMS
Verifywith the original vote payload + signature. - Verify returns valid / invalid.
The auditor (the AFL’s integrity team) had read access to Spanner and read access to KMS verify. They could independently verify any vote without trusting the platform team.
That separation — the platform team can’t fake votes because they can’t fake signatures; the integrity team can verify without trusting the platform — was the entire point of using KMS.
Security Command Center
SCC ran continuously during the broadcast. The monitored controls:
- API security findings: anything anomalous in the API responses (unexpected error rates, unusual headers).
- IAM anomalies: any IAM change during the broadcast window (there shouldn’t be any).
- Network anomalies: unusual traffic patterns at the load balancer.
- Workload identity drift: Cloud Run service accounts shouldn’t change permissions.
SCC findings during the broadcast went to a dedicated Slack channel staffed by a 2-person security team. Most findings were noise (a Looker dashboard rebuilding, a backup job running). The non-noise findings — there were maybe 3 across the broadcast — were investigated within minutes.
What broke (and what didn’t)
Across the broadcast season, the platform had no integrity incidents. The things that went sideways:
-
CDN cache invalidation lag. A revealed-round vote tally was occasionally cached longer than intended; users saw stale tallies for ~5 seconds. Fix: shorter TTL on the tally endpoint, accepted the higher origin load.
-
A regional Cloud Run cold start spike. One region had a slow cold-start window during the broadcast start; latency spiked to 800ms for ~30 seconds. The CDN absorbed most user impact; the in-flight votes succeeded with retries. Fix: keep minimum instances warm in each region during the broadcast window.
-
A partner analytics consumer fell behind. The Pub/Sub topic backed up; the analytics dashboards lagged real-time by ~10 minutes during peak. No vote impact; the analytics team accepted the lag and added consumers for next season.
The off-season
Between seasons, the platform ran at idle. Cloud Run scaled to zero; Spanner stayed warm with minimal nodes; the KMS keys stayed in place.
Annual cost out-of-season was trivial. Annual cost during the broadcast windows was the bulk of the bill. The shape matched the business; we didn’t pay for a permanent infrastructure when the load was a six-week window per year.
What transfers
Three lessons from running a live-event workload:
-
Cold-to-hot scale speed beats sustained capacity for short-window high-throughput workloads. Cloud Run was the right shape; GKE wasn’t.
-
Move integrity-critical operations into managed services you can’t bypass. KMS signing made vote forgery impossible for us, not just impolite. The auditor could verify independently. That’s the strongest integrity story.
-
Run continuous security monitoring with a small dedicated team during the window. SCC’s findings during steady-state are noisy; during a high-stakes window, the signal-to-noise shifts and the findings matter.
The Brownlow platform was the most-watched piece of software I’ve shipped. The architecture wasn’t novel; the discipline around integrity and scale was the differentiator.