BigQuery Knowledge Graph for entity resolution at scale

The problem

You have a CRM table with 1.2 million contacts. Some of them are the same person under different spellings, addresses, or company affiliations. You want to merge them — but only when you’re sure.

Traditional approaches: fuzzy match on name + address (catches the easy cases, misses the renames). MDM tools (expensive, opinionated). Custom ML pipeline (months of work).

BigQuery Knowledge Graph: a SQL query.

The shape

BigQuery exposes a KNOWLEDGE_GRAPH function (and underlying tables) that lets you:

Look up canonical entity IDs from text.
Get the relationships between entities.
Score similarity between entity references.

For entity resolution, the workflow is:

WITH contact_entities AS (
  SELECT
    contact_id,
    name,
    ML.KNOWLEDGE_GRAPH_LOOKUP(STRUCT(name AS query)).entity_id AS kg_entity_id
  FROM crm.contacts
)
SELECT
  ARRAY_AGG(contact_id) AS likely_duplicates,
  kg_entity_id
FROM contact_entities
WHERE kg_entity_id IS NOT NULL
GROUP BY kg_entity_id
HAVING COUNT(*) > 1;

Rows that resolve to the same Knowledge Graph entity are likely the same person. The KG knows that “Bill Kennedy” the Go educator is a different entity than “Bill Kennedy” the politician; the lookup uses contextual signals you don’t have to engineer.

Why this is cheap

The Knowledge Graph is built-in. No per-row API call cost; the lookup is part of BigQuery’s billing (bytes scanned). For 1.2M rows the entity-resolution query ran in under a minute and cost a few dollars.

The same workload via a third-party MDM tool would have cost 5-figures per month and taken weeks to set up.

What it can’t do

Resolve to private entities. The KG knows about public people, companies, places. It doesn’t know about your internal-only entities (an internal project code, a customer-specific term). For those you still need custom resolution.
Handle data quality issues. “B Kennedy” vs “Bill Kennedy” vs “William Kennedy” — the KG handles the synonymy; “Bil Kenedy” (typos) needs fuzzy match first.

The pattern that worked best: fuzzy match for typo normalisation → KG lookup for canonical resolution → manual review of the borderline cases.

Production result

For one Searce client (CRM consolidation across acquired companies), this pipeline collapsed three months of consulting work into a one-week engagement. Identified ~80K likely duplicates out of 1.2M contacts with ~96% precision. The remaining 4% went into a manual-review queue.

Cost: about ~$200 of BigQuery scan billing for the resolution pass. Compare to the ~$25K an MDM vendor was quoting.

For any team running customer data in BigQuery, this is a tool worth knowing.