· 5 min read · ← All posts
Spanner Datastream Pub/Sub Dataflow Migration

CDC for Spanner migration — the handoff that breaks

Bulk-loading a 14 TB table into Spanner takes hours. Production can’t be offline that long. The standard pattern is bulk + CDC: bulk for the historical data, CDC for the changes that happen while the bulk runs. The handoff between the two is where migrations fall over.

The pipeline shape

Source DB (Postgres) ───► Datastream ───► Pub/Sub ───► Dataflow ───► Spanner
                              │
                              └─► writes change events from the
                                   source's logical replication slot

Datastream tails the source’s WAL. Pub/Sub buffers. Dataflow applies the changes to Spanner. The pipeline is delivery-reliable but not in-order.

The ordering problem

Pub/Sub doesn’t preserve order across messages — events for the same row can arrive out of order. The Dataflow consumer has to reorder them before applying to Spanner, or it’ll apply an old UPDATE on top of a newer one.

The Migration Tool’s CDC consumer does this with a windowed buffer keyed by the source’s commit timestamp:

for each event:
  insert into buffer keyed by (table, pk, commit_ts)
  flush buffer entries older than (now - window)

flush:
  for each (table, pk):
    apply the events in commit_ts order

The window has to be larger than the worst-case Datastream-to-Dataflow latency. Too small and you apply out-of-order. Too large and the destination lags. In practice 30-60 seconds is the right ballpark; we tune per workload.

The handoff

The handoff sequence — the one that’s easy to get wrong:

  1. Start CDC against the source. CDC begins capturing changes from time T0.
  2. Snapshot the source at time T1 > T0. The snapshot is a consistent point-in-time view.
  3. Bulk-load the snapshot into Spanner. Takes hours; CDC keeps running in parallel.
  4. Bulk completes at time T2. Spanner now has the snapshot up to T1.
  5. Apply CDC events with commit_ts > T1 to Spanner. This is the gap between the snapshot and now.
  6. Cutover when CDC lag is small (e.g. <10 seconds): briefly block writes on the source, drain CDC, switch reads + writes to Spanner.

Steps 1 and 2 must be in that order. If you snapshot first and start CDC second, you have a gap between the snapshot timestamp and the CDC start where changes are lost. Most migration failures trace to inverting these two steps.

The deduplication problem

Step 5 applies CDC events from T0 to now. But the bulk in step 3 already applied changes up to T1. So the CDC events between T0 and T1 are duplicates of changes already in Spanner.

Two ways to handle this:

  1. Idempotent application. Use INSERT OR UPDATE semantics. The duplicate doesn’t matter because applying the same change twice is a no-op.
  2. Filter by timestamp. Drop CDC events with commit_ts < T1.

Option 1 is more robust if your schema supports it. Option 2 is faster but breaks if the snapshot timestamp is fuzzy (some databases give you only second-resolution timestamps for the snapshot, and CDC events at the boundary become ambiguous).

The Migration Tool uses option 1 by default and falls back to option 2 for sources where idempotent application isn’t safe.

What goes wrong

Three classic failure modes:

1. Snapshot is too old

You started the snapshot at midnight, the bulk ran for 14 hours, CDC has been buffering changes for 14 hours, the CDC consumer hasn’t kept up. The lag is now hours, not seconds. Cutover would be a multi-hour outage.

Mitigation: start the CDC consumer earlier, or run it with more parallelism, or shard the bulk so it completes faster.

2. Schema drift

The source schema changed during the migration. A column was added. The CDC events for that column don’t have a Spanner column to land in. Dataflow errors out.

Mitigation: freeze schema changes during the migration window. If that’s not possible, plan a schema sync step before the cutover.

3. Dataflow consumer crash with state loss

The Dataflow consumer’s buffer was in memory. Job restart loses the buffer; the next CDC events apply out of order against rows that were waiting for their predecessor.

Mitigation: persist the buffer state. The Migration Tool does this via Cloud Storage checkpointing; if you roll your own consumer, build the same.

DLQ patterns

When the consumer can’t apply an event (schema mismatch, type incompatibility, FK violation), send it to a dead-letter queue rather than crashing. The DLQ gives operators a place to inspect and re-apply manually.

The DLQ schema we used:

{
  "source_event": <original Pub/Sub message>,
  "error": "FK violation on customer_id=123",
  "retry_count": 3,
  "last_attempt": "2026-06-04T14:23:01Z"
}

The DLQ topic feeds into a Cloud Storage bucket. An operator can download a day’s worth, fix root cause in the source, and replay.

The cutover script

The cutover is the most-rehearsed part of the migration. The script:

1. Acquire write lock on source (deny new writes)
2. Wait for CDC lag to reach zero
3. Verify row counts: source = spanner
4. Sample-diff 1000 random rows
5. Update application config: writes go to spanner
6. Update application config: reads go to spanner
7. Release write lock on source (now read-only)
8. Monitor for 24 hours; rollback path remains until verified

Steps 5 and 6 are usually separated by minutes — flip writes first, wait for the dust to settle, then flip reads. If reads break, you have a fallback.

The script lives in version control. The team rehearses it against staging cutovers at least three times before the production cutover. The first one always reveals something missed.

What I learned contributing

The CDC handoff was the most subtle part of the Spanner Migration Tool I worked on. The bulk loader is well-trodden territory; the handoff is where the per-source quirks accumulate. Most of the PRs I landed in this area were small fixes — a different ordering guarantee from a new Datastream version, a corner case in the deduplication when the source used millisecond timestamps, an improvement in the DLQ schema.

The pattern transfers beyond Spanner. Any minimal-downtime migration that mixes bulk + CDC has this shape. The discipline around start order, deduplication, and cutover rehearsal is more important than the specific tools.

← Back to all posts