Podcast
Data-Centric AI: Improve Label Quality & Edit Datasets to Boost Model Performance
Open original DataTalks.Club episode
Data-Centric AI: Improve Label Quality & Edit Datasets to Boost Model Performance
Original Episode
Use these links for the canonical episode and media sources.
- Open the original DataTalks.Club podcast page
- Watch on YouTube
- Listen on Spotify
- Listen on Apple Podcasts
Episode Overview
How much can improving label quality and editing your dataset actually boost model performance? In this episode, Marysia Winkels — Lead Data Scientist at GoDataDriven with a Master’s in Artificial Intelligence and a focus on data-efficient deep learning, and co-organizer of PyData Amsterdam/Global — walks through a practical, data-centric approach to that question.
People
Use these links to connect the episode to guest notes.
Chapter Summary
Use these checkpoints to decide whether to open the source transcript.
- 1:26 - Podcast Introduction
- 2:03 - AI education & geometric deep learning in medical imaging
- 3:04 - Data science education and course development
- 4:51 - Building a community of practice and improving product maturity
- 5:24 - Data-Centric AI: shifting focus from Big Data to Good Data
- 5:54 - Model-centric vs data-centric approaches; challenges with unstructured data
- 10:28 - Transfer learning & fine-tuning: why label quality matters more now
- 13:45 - Data-centric competition case: fixed ResNet model with editable dataset
- 15:05 - Competition lessons: accessibility, strategy, and innovation award
- 17:44 - Strategic data augmentation vs brute-force data collection
- 18:46 - Mindset shift: treating datasets as editable artifacts
- 19:24 - Validation split adjustments and maintaining fair model comparisons
- 22:25 - Iterating on both data and model; prioritizing impactful data fixes
- 23:02 - Tooling spectrum: labeling, synthetic data, and data versioning
- 23:24 - Practical workflows: lightweight versioning and easy data edits
- 26:26 - Low-tech iteration: Google Sheets labeling plus automation scripts
- 27:55 - Targeted relabeling using model confidence and image embeddings
- 32:22 - Curated resources: Haiti Research and WhyData tool directories
- 33:16 - Iterative loop: baseline model, error analysis, and SME validation
- 35:24 - Beyond cleaning: representativeness, bias, and dataset completeness
- 36:14 - Detecting dataset gaps with embeddings and UMAP (penguin example)
- 39:46 - Defining real-world contexts: lighting, angles, and edge cases
- 41:47 - Acceptance criteria: deciding when dataset quality is sufficient
- 44:13 - Production feedback loops: collecting user feedback post-deployment
- 46:52 - Shadow mode rollout: passive deployment for safe feedback collection
- 49:09 - Scarce or low-quality data: feasibility, manual fixes, and limits
- 50:45 - Automating dataset repairs vs manual editing trade-offs
- 50:56 - PyData involvement: organizing meetups, tutorials, and global events
- 56:01 - PyData vs PyCon: data focus, language inclusivity, and NumFOCUS support