· 5 min read · ← All posts
Spanner Open Source Go Database Migration

The Spanner Migration Tool — a contributor’s reading map

Google’s open-source Spanner Migration Tool (formerly HarbourBridge) is a substantial Go codebase. Here is the reading path that worked for me when I started contributing, and the mental model that makes the rest of the repo navigable.

The three engines

The tool is conceptually three engines bolted together:

┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
│ schema engine    │ → │ data engine      │ → │ verification     │
│                  │   │                  │   │ engine           │
│ - parse source   │   │ - dump → COPY    │   │ - row counts     │
│ - emit Spanner   │   │ - or CDC stream  │   │ - sample diffs   │
│   DDL            │   │   via Datastream │   │ - schema drift   │
│ - schema review  │   │ + Pub/Sub +      │   │   detection      │
│   workflow       │   │   Dataflow       │   │                  │
└──────────────────┘   └──────────────────┘   └──────────────────┘

Most of the surface area lives in the schema engine. Most of the performance work lives in the data engine. Most of the support tickets land in verification.

Where to start reading

sources/ is where the per-source parsers live: Postgres, MySQL, DynamoDB, MS SQL, Oracle. Each one implements a small interface that emits an intermediate representation. The IR is the contract between the parsers and the rest of the tool.

spanner/ is where the Spanner DDL emission lives. It takes the IR, applies the schema-review decisions, and produces the CREATE TABLE statements.

cmd/ has the CLI entry points. webv2/ has the React + Go UI for the Intelligent Schema Assistant.

Read in this order:

  1. internal/internal.go — the IR types. Everything that crosses the parser ↔ emitter boundary is one of these.
  2. One of sources/postgres/*.go end to end. Postgres is the best-maintained source; you’ll see the pattern for the others.
  3. spanner/ddl/ast.go and the emitter. This is where the Spanner-specific decisions land.
  4. webv2/api/ for the JSON contract between the React UI and the Go backend.

The IR is the spine

If you only read one file, read internal/internal.go. The IR encodes:

Every parser produces this; every emitter consumes it. The schema review step (manual review by a DBA in the UI) edits it in place. The data migration step reads the final IR to know how to map rows.

If you’re adding a new source database, your job is: take the source schema, produce a valid IR. If you’re improving Spanner emission, your job is: take the IR, produce better DDL.

The parts that look simple

Type mapping. Postgres numeric(38, 9) → Spanner NUMERIC. Sounds trivial. Then you discover that Postgres’ numeric has unbounded precision unless declared, that some sources use it as a money type with implicit scale, and that the choice affects index efficiency on the Spanner side. The type-mapping code is short. The decisions behind it took weeks per source.

Primary key inference. When the source table has no PK (Postgres lets you), the tool synthesises one. The synthesis has to be stable across reruns or the data migration breaks. The current heuristic walks unique indexes first, then composite candidates; understanding why it’s in that order is worth two hours.

Foreign keys. Spanner supports them now (it didn’t always). The emitter still has the option to drop FKs and re-add them after the data load. That option is the right default for big migrations and the wrong default for small ones. Read spanner/ddl/foreignkey.go and the tests around it before changing behaviour.

The parts that don’t look simple, and aren’t

Interleaving. When to interleave a child table into a parent for locality is a judgment call. The tool exposes the choice via the schema review UI; the heuristic for the recommendation is in schema/recommendations/interleave.go. The recommendation accounts for child table size, expected join patterns, and the PK shape. Getting this right is one of the largest sources of post-migration performance wins (40-60% on the engagements I worked on).

Bulk vs CDC migration mode. Bulk is faster for the initial cut; CDC is necessary for minimal-downtime migrations. The handoff between the two has subtle ordering requirements (the CDC has to start before the bulk completes, so changes during the bulk are captured). That handoff lives across several packages; the comments in streaming/streaming.go are the best documentation.

Contributing tips

What contributing looked like

The PRs I shipped clustered around three themes:

  1. Schema redesign for write hotspots. PK design that avoided monotonically-increasing keys; this improved post-migration write throughput by 40-60% on the workloads we tested.
  2. CDC pipeline reliability. Datastream + Pub/Sub + Dataflow has out-of-order delivery semantics; added a DLQ + reordering buffer in the consumer.
  3. Backend APIs for the Intelligent Schema Assistant. The tool’s UI exposes a guided workflow; the JSON contract on the backend had grown organically and needed normalisation.

If you’re considering contributing: pick a real source database you’ve worked with, find an issue tagged good first issue against it, and start there. The maintainers are responsive and the bar for first PRs is reasonable.

What I learned

Reading a substantial Go codebase end-to-end is the single best way to improve as a Go engineer. The Spanner Migration Tool repays the reading because it solves a real problem (database migrations are expensive, this tool makes them dramatically cheaper) and because the patterns are general (an IR-based transformation pipeline shows up everywhere). The drift I had as a Go programmer over a year of contributing was bigger than any other Go work I did in the same period.

← Back to all posts