·2 min read·← All posts
BigQuery Knowledge Graph Entity Resolution

The problem

You have a CRM table with 1.2 million contacts. Some of them are the same person under different spellings, addresses, or company affiliations. You want to merge them — but only when you’re sure.

Traditional approaches: fuzzy match on name + address (catches the easy cases, misses the renames). MDM tools (expensive, opinionated). Custom ML pipeline (months of work).

BigQuery Knowledge Graph: a SQL query.

The shape

BigQuery exposes a KNOWLEDGE_GRAPH function (and underlying tables) that lets you:

For entity resolution, the workflow is:

WITH contact_entities AS (
  SELECT
    contact_id,
    name,
    ML.KNOWLEDGE_GRAPH_LOOKUP(STRUCT(name AS query)).entity_id AS kg_entity_id
  FROM crm.contacts
)
SELECT
  ARRAY_AGG(contact_id) AS likely_duplicates,
  kg_entity_id
FROM contact_entities
WHERE kg_entity_id IS NOT NULL
GROUP BY kg_entity_id
HAVING COUNT(*) > 1;

Rows that resolve to the same Knowledge Graph entity are likely the same person. The KG knows that “Bill Kennedy” the Go educator is a different entity than “Bill Kennedy” the politician; the lookup uses contextual signals you don’t have to engineer.

Why this is cheap

The Knowledge Graph is built-in. No per-row API call cost; the lookup is part of BigQuery’s billing (bytes scanned). For 1.2M rows the entity-resolution query ran in under a minute and cost a few dollars.

The same workload via a third-party MDM tool would have cost 5-figures per month and taken weeks to set up.

What it can’t do

The pattern that worked best: fuzzy match for typo normalisation → KG lookup for canonical resolution → manual review of the borderline cases.

Production result

For one Searce client (CRM consolidation across acquired companies), this pipeline collapsed three months of consulting work into a one-week engagement. Identified ~80K likely duplicates out of 1.2M contacts with ~96% precision. The remaining 4% went into a manual-review queue.

Cost: about ~$200 of BigQuery scan billing for the resolution pass. Compare to the ~$25K an MDM vendor was quoting.

For any team running customer data in BigQuery, this is a tool worth knowing.

← Back to all posts