The problem
You have a CRM table with 1.2 million contacts. Some of them are the same person under different spellings, addresses, or company affiliations. You want to merge them — but only when you’re sure.
Traditional approaches: fuzzy match on name + address (catches the easy cases, misses the renames). MDM tools (expensive, opinionated). Custom ML pipeline (months of work).
BigQuery Knowledge Graph: a SQL query.
The shape
BigQuery exposes a KNOWLEDGE_GRAPH function (and underlying tables) that lets you:
- Look up canonical entity IDs from text.
- Get the relationships between entities.
- Score similarity between entity references.
For entity resolution, the workflow is:
WITH contact_entities AS (
SELECT
contact_id,
name,
ML.KNOWLEDGE_GRAPH_LOOKUP(STRUCT(name AS query)).entity_id AS kg_entity_id
FROM crm.contacts
)
SELECT
ARRAY_AGG(contact_id) AS likely_duplicates,
kg_entity_id
FROM contact_entities
WHERE kg_entity_id IS NOT NULL
GROUP BY kg_entity_id
HAVING COUNT(*) > 1;
Rows that resolve to the same Knowledge Graph entity are likely the same person. The KG knows that “Bill Kennedy” the Go educator is a different entity than “Bill Kennedy” the politician; the lookup uses contextual signals you don’t have to engineer.
Why this is cheap
The Knowledge Graph is built-in. No per-row API call cost; the lookup is part of BigQuery’s billing (bytes scanned). For 1.2M rows the entity-resolution query ran in under a minute and cost a few dollars.
The same workload via a third-party MDM tool would have cost 5-figures per month and taken weeks to set up.
What it can’t do
- Resolve to private entities. The KG knows about public people, companies, places. It doesn’t know about your internal-only entities (an internal project code, a customer-specific term). For those you still need custom resolution.
- Handle data quality issues. “B Kennedy” vs “Bill Kennedy” vs “William Kennedy” — the KG handles the synonymy; “Bil Kenedy” (typos) needs fuzzy match first.
The pattern that worked best: fuzzy match for typo normalisation → KG lookup for canonical resolution → manual review of the borderline cases.
Production result
For one Searce client (CRM consolidation across acquired companies), this pipeline collapsed three months of consulting work into a one-week engagement. Identified ~80K likely duplicates out of 1.2M contacts with ~96% precision. The remaining 4% went into a manual-review queue.
Cost: about ~$200 of BigQuery scan billing for the resolution pass. Compare to the ~$25K an MDM vendor was quoting.
For any team running customer data in BigQuery, this is a tool worth knowing.