CDC for Spanner migration — the handoff that breaks
Bulk-loading a 14 TB table into Spanner takes hours. Production can’t be offline that long. The standard pattern is bulk + CDC: bulk for the historical data, CDC for the changes that happen while the bulk runs. The handoff between the two is where migrations fall over.
The pipeline shape
Source DB (Postgres) ───► Datastream ───► Pub/Sub ───► Dataflow ───► Spanner
│
└─► writes change events from the
source's logical replication slot
Datastream tails the source’s WAL. Pub/Sub buffers. Dataflow applies the changes to Spanner. The pipeline is delivery-reliable but not in-order.
The ordering problem
Pub/Sub doesn’t preserve order across messages — events for the same row can arrive out of order. The Dataflow consumer has to reorder them before applying to Spanner, or it’ll apply an old UPDATE on top of a newer one.
The Migration Tool’s CDC consumer does this with a windowed buffer keyed by the source’s commit timestamp:
for each event:
insert into buffer keyed by (table, pk, commit_ts)
flush buffer entries older than (now - window)
flush:
for each (table, pk):
apply the events in commit_ts order
The window has to be larger than the worst-case Datastream-to-Dataflow latency. Too small and you apply out-of-order. Too large and the destination lags. In practice 30-60 seconds is the right ballpark; we tune per workload.
The handoff
The handoff sequence — the one that’s easy to get wrong:
- Start CDC against the source. CDC begins capturing changes from time T0.
- Snapshot the source at time T1 > T0. The snapshot is a consistent point-in-time view.
- Bulk-load the snapshot into Spanner. Takes hours; CDC keeps running in parallel.
- Bulk completes at time T2. Spanner now has the snapshot up to T1.
- Apply CDC events with commit_ts > T1 to Spanner. This is the gap between the snapshot and now.
- Cutover when CDC lag is small (e.g. <10 seconds): briefly block writes on the source, drain CDC, switch reads + writes to Spanner.
Steps 1 and 2 must be in that order. If you snapshot first and start CDC second, you have a gap between the snapshot timestamp and the CDC start where changes are lost. Most migration failures trace to inverting these two steps.
The deduplication problem
Step 5 applies CDC events from T0 to now. But the bulk in step 3 already applied changes up to T1. So the CDC events between T0 and T1 are duplicates of changes already in Spanner.
Two ways to handle this:
- Idempotent application. Use
INSERT OR UPDATEsemantics. The duplicate doesn’t matter because applying the same change twice is a no-op. - Filter by timestamp. Drop CDC events with commit_ts < T1.
Option 1 is more robust if your schema supports it. Option 2 is faster but breaks if the snapshot timestamp is fuzzy (some databases give you only second-resolution timestamps for the snapshot, and CDC events at the boundary become ambiguous).
The Migration Tool uses option 1 by default and falls back to option 2 for sources where idempotent application isn’t safe.
What goes wrong
Three classic failure modes:
1. Snapshot is too old
You started the snapshot at midnight, the bulk ran for 14 hours, CDC has been buffering changes for 14 hours, the CDC consumer hasn’t kept up. The lag is now hours, not seconds. Cutover would be a multi-hour outage.
Mitigation: start the CDC consumer earlier, or run it with more parallelism, or shard the bulk so it completes faster.
2. Schema drift
The source schema changed during the migration. A column was added. The CDC events for that column don’t have a Spanner column to land in. Dataflow errors out.
Mitigation: freeze schema changes during the migration window. If that’s not possible, plan a schema sync step before the cutover.
3. Dataflow consumer crash with state loss
The Dataflow consumer’s buffer was in memory. Job restart loses the buffer; the next CDC events apply out of order against rows that were waiting for their predecessor.
Mitigation: persist the buffer state. The Migration Tool does this via Cloud Storage checkpointing; if you roll your own consumer, build the same.
DLQ patterns
When the consumer can’t apply an event (schema mismatch, type incompatibility, FK violation), send it to a dead-letter queue rather than crashing. The DLQ gives operators a place to inspect and re-apply manually.
The DLQ schema we used:
{
"source_event": <original Pub/Sub message>,
"error": "FK violation on customer_id=123",
"retry_count": 3,
"last_attempt": "2026-06-04T14:23:01Z"
}
The DLQ topic feeds into a Cloud Storage bucket. An operator can download a day’s worth, fix root cause in the source, and replay.
The cutover script
The cutover is the most-rehearsed part of the migration. The script:
1. Acquire write lock on source (deny new writes)
2. Wait for CDC lag to reach zero
3. Verify row counts: source = spanner
4. Sample-diff 1000 random rows
5. Update application config: writes go to spanner
6. Update application config: reads go to spanner
7. Release write lock on source (now read-only)
8. Monitor for 24 hours; rollback path remains until verified
Steps 5 and 6 are usually separated by minutes — flip writes first, wait for the dust to settle, then flip reads. If reads break, you have a fallback.
The script lives in version control. The team rehearses it against staging cutovers at least three times before the production cutover. The first one always reveals something missed.
What I learned contributing
The CDC handoff was the most subtle part of the Spanner Migration Tool I worked on. The bulk loader is well-trodden territory; the handoff is where the per-source quirks accumulate. Most of the PRs I landed in this area were small fixes — a different ordering guarantee from a new Datastream version, a corner case in the deduplication when the source used millisecond timestamps, an improvement in the DLQ schema.
The pattern transfers beyond Spanner. Any minimal-downtime migration that mixes bulk + CDC has this shape. The discipline around start order, deduplication, and cutover rehearsal is more important than the specific tools.