The Spanner Migration Tool — a contributor’s reading map
Google’s open-source Spanner Migration Tool (formerly HarbourBridge) is a substantial Go codebase. Here is the reading path that worked for me when I started contributing, and the mental model that makes the rest of the repo navigable.
The three engines
The tool is conceptually three engines bolted together:
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ schema engine │ → │ data engine │ → │ verification │
│ │ │ │ │ engine │
│ - parse source │ │ - dump → COPY │ │ - row counts │
│ - emit Spanner │ │ - or CDC stream │ │ - sample diffs │
│ DDL │ │ via Datastream │ │ - schema drift │
│ - schema review │ │ + Pub/Sub + │ │ detection │
│ workflow │ │ Dataflow │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Most of the surface area lives in the schema engine. Most of the performance work lives in the data engine. Most of the support tickets land in verification.
Where to start reading
sources/ is where the per-source parsers live: Postgres, MySQL,
DynamoDB, MS SQL, Oracle. Each one implements a small interface
that emits an intermediate representation. The IR is the contract
between the parsers and the rest of the tool.
spanner/ is where the Spanner DDL emission lives. It takes the
IR, applies the schema-review decisions, and produces the CREATE
TABLE statements.
cmd/ has the CLI entry points. webv2/ has the React + Go UI
for the Intelligent Schema Assistant.
Read in this order:
internal/internal.go— the IR types. Everything that crosses the parser ↔ emitter boundary is one of these.- One of
sources/postgres/*.goend to end. Postgres is the best-maintained source; you’ll see the pattern for the others. spanner/ddl/ast.goand the emitter. This is where the Spanner-specific decisions land.webv2/api/for the JSON contract between the React UI and the Go backend.
The IR is the spine
If you only read one file, read internal/internal.go. The IR
encodes:
- Tables, columns, types
- Primary keys, foreign keys, indexes
- Interleaving decisions
- Per-column comments (used to carry source metadata through the migration)
Every parser produces this; every emitter consumes it. The schema review step (manual review by a DBA in the UI) edits it in place. The data migration step reads the final IR to know how to map rows.
If you’re adding a new source database, your job is: take the source schema, produce a valid IR. If you’re improving Spanner emission, your job is: take the IR, produce better DDL.
The parts that look simple
Type mapping. Postgres numeric(38, 9) → Spanner NUMERIC. Sounds
trivial. Then you discover that Postgres’ numeric has unbounded
precision unless declared, that some sources use it as a money type
with implicit scale, and that the choice affects index efficiency
on the Spanner side. The type-mapping code is short. The decisions
behind it took weeks per source.
Primary key inference. When the source table has no PK (Postgres lets you), the tool synthesises one. The synthesis has to be stable across reruns or the data migration breaks. The current heuristic walks unique indexes first, then composite candidates; understanding why it’s in that order is worth two hours.
Foreign keys. Spanner supports them now (it didn’t always). The
emitter still has the option to drop FKs and re-add them after the
data load. That option is the right default for big migrations and
the wrong default for small ones. Read spanner/ddl/foreignkey.go
and the tests around it before changing behaviour.
The parts that don’t look simple, and aren’t
Interleaving. When to interleave a child table into a parent for
locality is a judgment call. The tool exposes the choice via the
schema review UI; the heuristic for the recommendation is in
schema/recommendations/interleave.go. The recommendation accounts
for child table size, expected join patterns, and the PK shape.
Getting this right is one of the largest sources of post-migration
performance wins (40-60% on the engagements I worked on).
Bulk vs CDC migration mode. Bulk is faster for the initial cut;
CDC is necessary for minimal-downtime migrations. The handoff
between the two has subtle ordering requirements (the CDC has to
start before the bulk completes, so changes during the bulk are
captured). That handoff lives across several packages; the comments
in streaming/streaming.go are the best documentation.
Contributing tips
- Run the tests against a real Spanner emulator. The CI does; your local dev loop should too. Most parser bugs only show up on Spanner-side validation.
- Write the failing test first. Most issues in the tracker are reproducible from a small schema snippet. Capture the snippet, open a PR with a failing test, then fix.
- The CHANGELOG matters. This tool is used by enterprise customers; their migration scripts pin to versions. Read the CHANGELOG before changing any default behaviour.
- Don’t change the JSON contract between the React UI and the Go backend casually. The UI version and the CLI version drift; a backwards-incompatible change breaks customers mid-migration.
What contributing looked like
The PRs I shipped clustered around three themes:
- Schema redesign for write hotspots. PK design that avoided monotonically-increasing keys; this improved post-migration write throughput by 40-60% on the workloads we tested.
- CDC pipeline reliability. Datastream + Pub/Sub + Dataflow has out-of-order delivery semantics; added a DLQ + reordering buffer in the consumer.
- Backend APIs for the Intelligent Schema Assistant. The tool’s UI exposes a guided workflow; the JSON contract on the backend had grown organically and needed normalisation.
If you’re considering contributing: pick a real source database
you’ve worked with, find an issue tagged good first issue against
it, and start there. The maintainers are responsive and the bar
for first PRs is reasonable.
What I learned
Reading a substantial Go codebase end-to-end is the single best way to improve as a Go engineer. The Spanner Migration Tool repays the reading because it solves a real problem (database migrations are expensive, this tool makes them dramatically cheaper) and because the patterns are general (an IR-based transformation pipeline shows up everywhere). The drift I had as a Go programmer over a year of contributing was bigger than any other Go work I did in the same period.