Podcast
Building an Open-Source ML-Powered Identity Resolution Tool in the Modern Data Stack
Open original DataTalks.Club episode
Building an Open-Source ML-Powered Identity Resolution Tool in the Modern Data Stack
Original Episode
Use these links for the canonical episode and media sources.
- Open the original DataTalks.Club podcast page
- Watch on YouTube
- Listen on Spotify
- Listen on Apple Podcasts
Episode Overview
How do you build an open-source, ML-powered identity resolution tool that becomes the single source of truth in a modern data stack? In this episode Sonal Goyal—founder of Zingg and a 23-year data product veteran—walks through the practical challenges of identity resolution and entity resolution across industries like investment banking, telecom, gaming, and insurance. Sonal explains why ML-powered approaches matter, how an open-source framework like Zingg can fit into your modern data stack, and what it takes to.
People
Use these links to connect the episode to guest notes.
Chapter Summary
Use these checkpoints to decide whether to open the source transcript.
- 0:00 - Podcast Introduction
- 1:11 - Guest Overview: Sonal Goyal and Zingg identity resolution
- 2:06 - Career Overview: 24 years in tech, data consulting background
- 2:58 - Origin Story: Consulting projects reveal recurring identity gaps
- 4:51 - Modern Data Stack: Centralized data exposing identity challenges
- 5:43 - Product Overview: Zingg — ML-powered identity resolution
- 7:14 - Terminology: Entity resolution vs identity resolution
- 7:52 - Duplicate Detection vs Deduplication: Outcomes and use cases
- 9:08 - Motivation: Recurring duplicate problems across domains
- 11:09 - Solution Generality: Customers, products, patients and suppliers
- 13:38 - Related Terms: Record linkage, entity matching, entity disambiguation
- 14:02 - Core Approach: ML training, blocking, indexing for scale
- 18:13 - Implementation: Spark distribution, Snowflake-native & Python API
- 20:41 - Interfaces & Integrations: CLI, Python SDK, Databricks, dbt, UI plans
- 21:51 - Founder Transition: From consultancy to full-time product build
- 23:00 - Development Timeline: Proof-of-concept to public release (~18 months)
- 24:14 - Open Source Strategy: Community, adoption, and business rationale
- 27:00 - Licensing Choice: AGPL to prevent SaaS rehosting and protect IP
- 31:10 - Open Source Trade-offs: IP concerns vs discoverability and growth
- 32:00 - Team Evolution: Solo founder, consultants, and initial hires
- 32:59 - Founder Role: Product, ecosystem integrations, community and hiring
- 35:14 - Team & Hiring: First developer hire and fully remote setup
- 37:21 - Scaling Challenge: Recruiting the right engineering talent
- 38:43 - Prevention Limits: Data governance won’t fully eliminate identity issues
- 40:36 - Beyond Joins: When fuzzy joins and basic ETL aren’t enough
- 44:25 - Deterministic Rules vs Probabilistic ML: Trade-offs for accuracy
- 45:50 - Fraud Use Cases: Identity resolution for AML and fraud detection
- 49:23 - Graph + ML: Pairwise matching, graph clustering and downstream use
- 50:20 - Data Mapping: Need to specify field correspondences for matching
- 51:39 - Impact Case Studies: Public-data donors, e-commerce and classifieds
- 54:11 - Retrospective: Seeking cofounder earlier and open-sourcing sooner
- 56:07 - Founder Advice: Validate use cases, distribution channels, and conviction
- 59:26 - Recommended Reading: Creative Selection on product design