PK design for Spanner — the hotspot story

Spanner partitions data by primary-key range and distributes those ranges across servers. A PK that increases monotonically sends every new write to whatever server owns the highest range — a hotspot. Here is what works instead, and the 40-60% gains we measured on migrated workloads.

The default that breaks Spanner

Most relational databases handle a monotonically-increasing PK fine. Postgres serial, MySQL auto_increment, SQL Server identity — all work because the storage engine is local. The B-tree gets a new rightmost leaf and life goes on.

Spanner doesn’t have a single storage engine. The PK range [0, A) might live on server 1, [A, B) on server 2, [B, ∞) on server 3. A monotonic PK sends every write to server 3 until the range splits, then to whichever new server owns the new top end.

You can hit throughput limits well below what the cluster could actually do, just because every write hits one shard.

The four reasonable strategies

In rough order of how often they’re the right answer:

1. UUID v4 (random)

CREATE TABLE events (
  id STRING(36) NOT NULL,        -- UUID v4
  user_id STRING(36) NOT NULL,
  ...
) PRIMARY KEY (id);

Random distribution across the key space. Splits well. Spanner is happy.

The cost: UUIDs as the PK are 36 characters of text. Indexes get big. Comparison is slower than integer compare. For tables with extreme write volumes (10K+ TPS sustained), the per-row overhead adds up.

2. Hash-prefixed integer

CREATE TABLE events (
  shard INT64 NOT NULL,          -- hash(user_id) MOD 64
  id INT64 NOT NULL,             -- sequence
  ...
) PRIMARY KEY (shard, id);

You keep an integer PK for compactness but prefix it with a hash-derived shard column. The shard distributes writes across N ranges; within each shard the rows are still ordered.

Picking the shard count matters. Too few and you still hotspot; too many and your range scans hit every shard. We landed on 64 for most workloads we migrated.

3. Reverse the sequence

CREATE TABLE events (
  id INT64 NOT NULL,             -- reverse-bit-order timestamp
  ...
) PRIMARY KEY (id);

If you have a sequence source, reverse its bit order before storing. A monotonically-increasing source becomes a randomly-distributed stored value. Range scans on the original ordering become harder but you can keep a secondary index for that.

This is the right answer when you must keep a single integer PK but can’t tolerate UUIDs.

4. Composite tenant + sequence

CREATE TABLE events (
  tenant_id STRING(36) NOT NULL,
  id INT64 NOT NULL,             -- sequence within tenant
  ...
) PRIMARY KEY (tenant_id, id);

For multi-tenant workloads, the tenant column distributes writes across the cluster naturally as long as no single tenant dominates. Same-tenant queries stay efficient because all of a tenant’s rows co-locate.

Watch out for power-law tenant sizes. One huge tenant becomes a within-tenant hotspot. Mitigation: hash-prefix within the dominant tenants only.

The migration moment

When we ported a workload from Postgres to Spanner, the PK was usually one of:

Postgres serial (auto-increment integer)
Postgres gen_random_uuid() (UUID v4)
Composite of timestamp + something else

The first one is the worst for Spanner. The second is fine. The third depends.

The Spanner Migration Tool’s schema review step flags monotonic PKs and asks the DBA to choose. The choice is usually:

Add a hash-prefix shard column.
Reverse-bit the integer.
Migrate to UUID.
Accept the hotspot (only for low-write tables).

In our migrations the most-picked option was the composite hash- prefix. It preserved the application’s existing notion of a numeric ID, kept range scans cheap, and distributed writes.

The 40-60% number

We measured pre-migration vs post-migration write throughput on production-shaped workloads. The improvement came from three sources:

PK redesign (the big one): hash-prefixed integers, no hotspot, 30-40% of the gain.
Interleaving children into parents: row co-location for join-heavy queries, 10-15%.
Index pruning: Spanner doesn’t like many secondary indexes; pruning the ones added in Postgres for legacy reasons recovered the remaining 5-10%.

Numbers were workload-dependent. On a write-heavy ledger workload we saw closer to 60%. On a read-heavy reporting workload the PK change mattered less and we landed at 25-30%.

What to do tomorrow

If you’re about to migrate to Spanner:

Audit your source PKs. Anything monotonic gets flagged.
For each flagged table, pick a strategy. Hash-prefix is the safest default for integer PKs.
Test write throughput on a representative subset before committing. Spanner’s behaviour at low row counts doesn’t predict its behaviour at scale.
Plan the interleaving decisions at the same time. The Migration Tool surfaces these in the schema review UI.

If you’re already on Spanner and seeing throughput plateau:

Look at the per-server CPU. If one is hot and the rest are idle, you have a hotspot.
The hot server tells you which PK range is taking the writes.
Schema change to redistribute. Spanner makes this less painful than most databases but it’s still a migration.

The reading list

The Spanner whitepaper (the original OSDI 2012 paper)
The Cloud Spanner schema design guide (Google docs)
spanner/ddl/recommendations/pk.go in the Migration Tool — the heuristic for what to flag

The reading is a few hours. The savings if you get this right are measured in shards never bought, throughput never throttled, and on-call pages never fired.