Globe — 30K+ TPS, never lose a transaction
The Globe transaction engine processed payments and adjacent events across multiple telco/FinTech partner integrations. Throughput peaked at 30K+ TPS during settlement windows. The contract was strict: no transaction can be lost, no transaction can be processed twice. Here is the architecture that delivered on it.
The shape
┌──────────────┐ ┌──────────┐ ┌─────────────┐ ┌──────────┐
│ partner APIs │ ─►│ ingest │─► │ Kafka topic │─► │ workers │
│ (REST/gRPC) │ │ service │ │ │ │ (Go pods)│
└──────────────┘ └────┬─────┘ └─────────────┘ └────┬─────┘
│ │
▼ ▼
Redis (idempotency keys) Spanner (ledger)
│
▼
Pub/Sub (downstream events)
Five Go services. Two data stores (Spanner + Redis). One Kafka. One Pub/Sub. K8s for orchestration. About 30K LoC across the services.
Idempotency as the only invariant
Everything else in the system can fail. The one thing that cannot fail is processing a transaction twice. We enforced this in three places:
-
At ingest. Every incoming request carried an
idempotency-keyheader. The ingest service wrote it to Redis with a TTL slightly longer than the partner’s retry window. A duplicate request hit Redis, found the key, and returned the cached response. -
At the worker. Workers consumed Kafka events. Each event had a
transaction_id. The worker wrote the transaction_id to the ledger table with a unique constraint. A duplicate event hit the constraint and the worker treated it as success (already-processed). -
At downstream emit. Outbound events to Pub/Sub also carried a deduplication ID. Pub/Sub’s built-in deduplication caught duplicates within a 10-minute window; the ledger’s audit table caught duplicates outside that window.
Three layers. A bug in any one of them is caught by the next.
Why we paid the Spanner price
The ledger was on Cloud Spanner. Spanner is expensive compared to single-region Postgres. We paid for it because of two properties:
-
External consistency. Spanner’s TrueTime gives globally consistent ordering. A transaction committed in one region is visible everywhere immediately. For a multi-region ledger that partners read, this is the difference between a working system and a system that needs to be explained.
-
Horizontal scale. 30K TPS sustained writes is past the ceiling of any single-instance Postgres we’d reasonably run. Spanner scales horizontally; we added shards by adding nodes.
Spanner’s PK design was a months-long argument. We landed on a
composite (partner_id, bucket, txn_id) where bucket was a
hash of the transaction id mod 32. This distributed writes across
the cluster while keeping per-partner range scans efficient.
DLQ + exponential backoff
When a worker couldn’t process an event, it didn’t crash; it sent the event to a dead-letter queue with the error and a retry count. The DLQ topic fed into a separate consumer that:
- Retried with exponential backoff (1s, 2s, 4s, 8s, …).
- Stopped after 7 retries (covering ~2 hours of partner outage).
- Wrote the final failure to a
failed_transactionstable for operator review.
The operator review flow was a daily report. Most failures were the partner having transient issues; the retries handled them. The ones that landed in the failed_transactions table were typically real partner contract violations (wrong currency, bad account number) that needed a human conversation.
Error-code orchestration
Early in the project, the dispatch was status-code based — every worker had a long switch over HTTP status codes. As partner count grew, the switches grew, and the dispatch logic became unreadable.
We re-architected to error-code driven orchestration. The partner adapters normalised every partner-specific error into an enumerated set:
type ErrorCode int
const (
ErrTransient ErrorCode = iota // retry
ErrInsufficientFunds // notify, no retry
ErrInvalidAccount // DLQ, operator review
ErrPartnerOutage // back off harder
ErrFatal // page
// ... about 20 codes total ...
)
The orchestration layer dispatched on the enum. Adding a new partner became: write the adapter, map its errors to the enum. The orchestration layer didn’t change.
Dual-layer auth
Partners authenticated with two layers:
- API key in a header — identified the partner.
- JWT in a header — identified the request.
Both were required. The API key was a long-lived shared secret (rotated quarterly). The JWT was a short-lived token (15-minute TTL) the partner generated for each request, signed with the partner’s private key.
The dual layer protected against two attack classes:
- Leaked API key — useless without the JWT signing key.
- Leaked JWT — useless after 15 minutes.
The rotation cadence for the API keys was the security-team’s contract. Rotation was a banner in the partner portal and a deprecation window of 30 days; we never had a partner miss a rotation in the windows I was on the project.
PCI-aligned data protection
The transactions carried no card data — partners handled card collection on their side and sent us tokenised references. That significantly reduced our PCI scope; we were only handling the non-card transaction metadata.
What we did carry was protected by:
- Encryption at rest (Spanner’s default, plus column-level KMS for the sensitive metadata fields).
- Encryption in transit (mTLS between every internal service, TLS 1.3 to partners).
- Access logging on every read of the protected columns.
- A quarterly scoping review with the security team to confirm we hadn’t accidentally taken on more PCI scope.
What the on-call experience felt like
Most days, nothing pages. The 30K TPS load is well within the cluster’s headroom; the workers process events with sub-second latency; the DLQ rate sits below 0.01%.
The exceptions were partner outages. A partner goes down; the DLQ fills with their transactions; the operator-review queue grows. The on-call playbook was: check the partner’s status page, confirm the outage is on their end, let the retries continue, and watch the DLQ drain when the partner came back.
About once a quarter, something interesting happened:
- A partner deployed a breaking API change without notice.
- The Kafka cluster lost a broker.
- A Spanner regional outage caused elevated latency.
For each of those, the playbook was: degrade gracefully (DLQ absorbs the burst, retries take over when capacity returns), then incident review afterwards to make the next instance smoother.
What I’d carry forward
Three patterns that earned their keep:
- Idempotency at three layers. Single-layer dedup will eventually fail. Three layers gives you a margin.
- Error codes, not status codes. The orchestration layer becomes maintainable when error semantics are explicit.
- DLQ before retry logic gets clever. The DLQ is your pressure valve. Without it, the cleverness in retry logic becomes a source of bugs.
30K+ TPS is not actually that high in 2026 terms. The architecture that survives it isn’t exotic. The discipline around idempotency, error handling, and graceful degradation is what makes it boring to operate.