Idempotency at three layers
Processing a transaction once is a hard problem. Processing a transaction exactly once is impossible without coordinating with the partner. Processing a transaction at most once on top of at-least-once delivery is the achievable shape. Here is the three-layer idempotency pattern that ran 30K+ TPS at Globe.
Why one layer isn’t enough
Every distributed system that says “we deduplicate” usually means “we deduplicate in one place.” That place is a single point of failure for the property you care about most.
Examples:
- Dedup at the ingest gateway. A failover replicates partial state, the next instance accepts a duplicate.
- Dedup at the database via unique constraint. A retry before the first transaction commits inserts a second row before the constraint fires.
- Dedup at the message queue. The queue’s deduplication window expires, a late retry slips through.
Each layer covers a class of failure. The classes don’t overlap completely. The interesting failures are the ones that escape one layer and would have been caught by the next — if there was a next.
The three layers
┌──────────────────────┐
│ Layer 1: Ingest │ Redis key with TTL
│ │ Covers: client retries
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Layer 2: Worker │ Spanner unique constraint
│ │ Covers: queue redelivery
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Layer 3: Emit │ Outbox + Pub/Sub dedup ID
│ │ Covers: downstream replay
└──────────────────────┘
Each layer uses a different mechanism. A bug in one mechanism doesn’t affect the others. The combined probability of a duplicate escaping all three is the product of the independent failure probabilities — much smaller than any single layer alone.
Layer 1 — Redis at ingest
The ingest service receives an HTTP request with an
Idempotency-Key header. The pattern:
key := r.Header.Get("Idempotency-Key")
if key == "" {
return 400, "idempotency-key required"
}
// SETNX with TTL — atomic check-and-set
ok, err := redis.SetNX(ctx, "idem:" + key, "in-flight", 1*time.Hour).Result()
if err != nil { return 500, err }
if !ok {
// Key exists; this is a duplicate.
// Two cases: "in-flight" or "completed: <response>"
val := redis.Get(ctx, "idem:" + key).Val()
if val == "in-flight" {
return 409, "duplicate request, original still processing"
}
return 200, val // return cached response
}
// Process the request
resp := process(r)
// Update the cache with the completed response
redis.Set(ctx, "idem:" + key, "completed: " + resp, 24*time.Hour)
return 200, resp
The TTL matters. Too short and partner retries leak through (“the original was an hour ago but you retried 2 hours later”). Too long and Redis fills with stale keys. We landed on 24h after the response is cached, matching the partner’s retry policy.
The in-flight marker handles concurrent duplicates — two copies
arriving at the same moment. One wins the SETNX; the other sees
“in-flight” and gets a 409. The partner retries; the second one
gets the cached response.
Layer 2 — Spanner unique constraint
The worker consumes Kafka events. Each event carries a
transaction_id that was set at ingest. The worker writes the
transaction to the ledger:
INSERT INTO ledger (txn_id, partner_id, amount, status, created_at)
VALUES (@txn_id, @partner, @amount, 'pending', CURRENT_TIMESTAMP());
txn_id has a unique constraint. If the event is a redelivery from
Kafka (which can happen — Kafka is at-least-once), the second
INSERT hits the constraint and fails.
The worker treats the constraint violation as success — the transaction is already recorded; the worker’s job is done. It commits the Kafka offset and moves on.
This is the layer that catches Kafka redeliveries, which happen more often than you’d think. Consumer rebalances, broker failures, long-running message processing — all produce occasional redelivery.
Layer 3 — outbox + Pub/Sub dedup ID
When the worker emits a downstream event (notification to the partner, accounting feed, settlement trigger), it doesn’t publish directly to Pub/Sub. It writes to an outbox table in the same Spanner transaction as the ledger write:
INSERT INTO outbox (event_id, type, payload, status)
VALUES (@event_id, 'transaction.completed', @payload, 'pending');
A separate process (the outbox dispatcher) reads from outbox and
publishes to Pub/Sub, marking the row as published on success.
Pub/Sub messages carry an ordering_key (for ordering) and a
message_id (deduped by Pub/Sub within a 10-minute window). The
combination means:
- Within 10 minutes, Pub/Sub itself dedupes redelivered messages.
- Beyond 10 minutes, the outbox row’s
status = 'published'flag prevents the dispatcher from publishing again.
A downstream consumer sees a Pub/Sub message at most once per event_id within the system’s lifetime.
What each layer catches
| Failure mode | Caught by |
|---|---|
| Partner retries within the same hour | Layer 1 (Redis) |
| Partner retries hours later, same key | Layer 1 (Redis, longer TTL) |
| Two concurrent partner requests with same key | Layer 1 (SETNX + in-flight marker) |
| Kafka redelivers an event the worker already saw | Layer 2 (Spanner unique constraint) |
| Worker crashes after Spanner write but before Kafka commit | Layer 2 (next worker re-processes, constraint fires) |
| Outbox dispatcher crashes after Pub/Sub publish but before marking row published | Layer 3 (Pub/Sub dedup window) |
| Downstream consumer replays an old offset | Layer 3 (downstream’s own idempotency) |
What’s still not idempotent
The pattern doesn’t guarantee exactly-once. It guarantees at-most- once per layer, with three layers. A duplicate has to escape all three; in practice they don’t.
What the pattern can’t fix:
- A partner that sends two genuinely different requests with the same idempotency-key. The system treats them as duplicates, which is the partner’s bug not ours.
- A long-enough delay between a request and a retry that exhausts the Redis TTL AND the outbox row was cleaned up. We retain the outbox indefinitely (cheap) to prevent this.
- An idempotency-key collision (two unrelated partners coincidentally generate the same UUID). Vanishingly unlikely; we scope the key to (partner_id, key) to make it impossible.
What the team agreed on
A small contract that every service in the system honoured:
- Every operation that mutates state takes an idempotency key.
- Every operation that emits an event includes a dedup ID.
- No service skips a layer for performance.
The “no skipping” rule is the boring one and the load-bearing one. Once you let one service skip Redis “because it’s fast enough,” you’ll find a duplicate three months later and the post-mortem will name that decision.
Where this transfers
The pattern isn’t payment-specific. Any system where exactly-once- processing-of-a-thing matters benefits from the same shape:
- Email sending (idempotency at queue-in, send-out, audit-row).
- File upload processing (dedup at upload, chunk-processing, manifest-commit).
- Webhook delivery (dedup at receipt, processing, downstream-call).
Two of those examples I’ve built since Globe; both used a variant of the three-layer pattern. The pattern doesn’t get old.