Podcast

Data-Centric AI: Improve Label Quality & Edit Datasets to Boost Model Performance

S12E3

Open original DataTalks.Club episode

YouTube Spotify Apple Podcasts

machine learning data science MLOps tools data governance

Data-Centric AI: Improve Label Quality & Edit Datasets to Boost Model Performance

Original Episode

Use these links for the canonical episode and media sources.

Open the original DataTalks.Club podcast page
Watch on YouTube
Listen on Spotify
Listen on Apple Podcasts

Episode Overview

How much can improving label quality and editing your dataset actually boost model performance? In this episode, Marysia Winkels — Lead Data Scientist at GoDataDriven with a Master’s in Artificial Intelligence and a focus on data-efficient deep learning, and co-organizer of PyData Amsterdam/Global — walks through a practical, data-centric approach to that question.

People

Use these links to connect the episode to guest notes.

Marysia Winkels

Chapter Summary

Use these checkpoints to decide whether to open the source transcript.

1:26 - Podcast Introduction
2:03 - AI education & geometric deep learning in medical imaging
3:04 - Data science education and course development
4:51 - Building a community of practice and improving product maturity
5:24 - Data-Centric AI: shifting focus from Big Data to Good Data
5:54 - Model-centric vs data-centric approaches; challenges with unstructured data
10:28 - Transfer learning & fine-tuning: why label quality matters more now
13:45 - Data-centric competition case: fixed ResNet model with editable dataset
15:05 - Competition lessons: accessibility, strategy, and innovation award
17:44 - Strategic data augmentation vs brute-force data collection
18:46 - Mindset shift: treating datasets as editable artifacts
19:24 - Validation split adjustments and maintaining fair model comparisons
22:25 - Iterating on both data and model; prioritizing impactful data fixes
23:02 - Tooling spectrum: labeling, synthetic data, and data versioning
23:24 - Practical workflows: lightweight versioning and easy data edits
26:26 - Low-tech iteration: Google Sheets labeling plus automation scripts
27:55 - Targeted relabeling using model confidence and image embeddings
32:22 - Curated resources: Haiti Research and WhyData tool directories
33:16 - Iterative loop: baseline model, error analysis, and SME validation
35:24 - Beyond cleaning: representativeness, bias, and dataset completeness
36:14 - Detecting dataset gaps with embeddings and UMAP (penguin example)
39:46 - Defining real-world contexts: lighting, angles, and edge cases
41:47 - Acceptance criteria: deciding when dataset quality is sufficient
44:13 - Production feedback loops: collecting user feedback post-deployment
46:52 - Shadow mode rollout: passive deployment for safe feedback collection
49:09 - Scarce or low-quality data: feasibility, manual fixes, and limits
50:45 - Automating dataset repairs vs manual editing trade-offs
50:56 - PyData involvement: organizing meetups, tutorials, and global events
56:01 - PyData vs PyCon: data focus, language inclusivity, and NumFOCUS support