Wiki

Reproducibility

How DataTalks.Club podcast guests make data science, ML, research, and data pipeline work rerunnable, reviewable, and explainable.

Reproducibility is the ability to rerun, review, or explain a data result after the original work has moved on. Podcast guests use the word in several connected ways.

Researchers should be able to recreate a paper from code and data. Data teams should be able to rerun a pipeline against known inputs. ML platform teams should be able to name the code and data versions. They should also name the model artifact, environment, and approval path that produced a prediction.

Reproducibility is the operating practice behind several related topics. The model lifecycle connects it to MLOps, Experiment Tracking, and Model Registry. Pipeline delivery connects it to DataOps, Data Quality and Observability, and Practices.

These wiki pages cover adjacent lifecycle and reliability work:

Core podcast discussions:

Common Definition

Across the podcast discussions, reproducibility means a team can recover the path from input to result. The result may be a paper figure or analytics table. It may also be a model artifact, dashboard, or prediction. The common requirement is that someone other than the original author can look at the assumptions, rerun the work, or explain why the result changed.

Johanna Bayer gives the research definition in Teaching Open Science and Reproducible Research at 8:30-8:51. Her reproducible-paper example packages code with the manuscript so another person can start from the data and regenerate the paper. At 39:27-43:11, she turns that idea into engineering practice. She names project structure, environments, Git branches, and formatting. She also uses MLflow for model version control and preserves metadata or model parameters when sensitive clinical data can’t be shared.

Lars Albertsson gives the data-platform definition in DataOps 101 at 16:42-21:29. He argues that mutable warehouse-style tables make reruns unstable because the same ETL process can produce different results at different times. His answer is immutable raw data plus functional transformations, orchestrated as pipelines that create new datasets instead of overwriting old ones.

Raphael Hoogvliets connects the same principle to ML delivery in MLOps at Scale at 42:54-44:46. He says full reproducibility is hard, but tying code to data versions helps teams reverse-engineer what happened. He also makes maturity part of the definition. Smaller teams may not need full data versioning on day one, but work with regulation or customer-facing decisions may need it earlier.

Guest Differences

Guests differ most on scope and timing because some start from teaching, while others start from operations or platform risk.

Bayer and O’Brien start from teaching. Researchers and junior data scientists need Git, data management, and reproducible examples. They also need an end-to-end view of a project before they enter teams that depend on their work. O’Brien discusses this in DevRel for Data Science at 4:22-7:59 and 52:06-55:28.

Albertsson and Bergh start from operations. Albertsson emphasizes architectural immutability and raw-data history, while Bergh emphasizes delivery discipline. Teams put code, reports, and transformations in version control. They also run automated tests, use CI/CD, and run the whole system against test data when possible (Mastering DataOps at 33:47-44:12).

Hoogvliets and Simon Stiebellehner start from ML platforms. They focus on experiment tracking, model registries, metadata, and data references. They also focus on deployment records, package registries, and monitoring.

Stiebellehner is especially cautious about copying every training dataset into an experiment tracker. Large datasets can make that approach unworkable. Cost and GDPR deletion requests can do the same (Building Production ML Platforms at 42:48-46:32).

They don’t disagree about whether reproducibility matters. They disagree about where to spend the next unit of effort. A research lab may need tests and a reproducible paper template. A data platform may need immutable raw inputs and workflow orchestration.

An ML platform may need experiment tracking first. As model risk grows, it may also need data versioning, dependency management, and lineage.

Research and Teaching

The academic episodes frame reproducibility as a skill gap as much as a tooling gap. Bayer says academia faces a reproducibility crisis. People publish papers that others can’t recreate, especially in fields such as neuroimaging (Teaching Open Science and Reproducible Research at 8:30-8:51).

Her course sequence starts with Git and reproducible publications. It then moves into tests, open source contribution, packaging, and environments. It also covers requirements files.

O’Brien describes the same gap from data science education. In DevRel for Data Science at 4:22-7:59, she says labs often lack training for long-lived data management, collaboration, and complete reproduction of code. She joined Iterative because she wanted to work on DVC. She later moved into teaching so applied data science students could make educated choices about managing projects before they enter industry.

Mihail Eric adds the research-to-production bridge in From Research to Production at 11:01-12:49. Researchers use notebooks, benchmarks, and tools such as Weights & Biases to validate hypotheses. At 17:35-17:53, he contrasts that with the ML engineer’s responsibility for deployment, uptime, and monitoring. He also names Docker, cloud infrastructure, and web services. Reproducibility improves when researchers learn engineering fundamentals and engineers learn how to reproduce models and track experiments.

Data Pipelines

Pipeline reproducibility starts with stable inputs. Albertsson’s DataOps 101 discussion treats immutable raw data as the foundation. At 16:42-21:29, he argues that teams should transform immutable datasets into new datasets rather than mutate tables in place. At 30:34-35:57, he adds workflow orchestration. Pipelines need explicit dependencies so late data, transient failures, and bugs can be retried and repaired.

Bergh’s DataOps version starts from the delivery path. In Mastering DataOps at 33:47-34:37, he names version control, automated tests, and development tests. He also names deployment automation and error tracking as practical steps.

At 44:12-48:59, he pushes beyond unit tests. Data teams should run the system end to end against realistic test data. They should also keep tests close to the code while running checks in development and production.

Tomasz Hinc shows why infrastructure belongs in the same reproducibility conversation. In DataOps and GitOps for Data Teams at 23:42-26:21, he describes infrastructure as code with Terraform, Terragrunt, and Atlantis. Teams use merge requests, review, and dry runs. At 1:01:27-1:01:43, he gives a dependency example where an unspecified Python package version caused a Dockerized application to fail after it fetched a newer API. Pinning versions isn’t ceremony there because it prevents a future run from silently becoming a different run.

MLOps and Platforms

ML reproducibility extends the pipeline record with experiment and artifact history. Stiebellehner describes experiment tracking as an early win for collaboration in Building Production ML Platforms at 29:41-30:58. The tracked run sits near the model registry because a useful run may become an artifact that downstream systems consume.

At 42:48-43:28 in the same episode, Stiebellehner expands the record. A platform may need to store which job image ran, which inputs it consumed, which outputs it wrote, and how metadata connects across pipeline runs. The model registry alone isn’t enough if a team wants to reproduce a model result from three years ago. Code versions, data versions, metadata, and workflow design all matter.

Hoogvliets turns that into an adoption sequence. In MLOps at Scale at 39:06-42:54, he groups CI, repository structure, and parameterization with testing and experiment preservation. At 48:41-53:08, he recommends starting from a real pain point. Start with CI/CD when deployment takes months, and start with monitoring when models are opaque in production. Escalate missing version control immediately.

Teams should add experiment tracking and model registries when they remove a concrete delivery risk. The same applies to serving, monitoring, package registries, and containers.

Data Boundaries and Governance

Reproducibility can conflict with privacy, cost, and governance. Bayer’s neuroimaging example shows why this matters for research in Teaching Open Science and Reproducible Research. At 42:22-44:12, she says sensitive consortium data can’t simply be pushed to a repository. Model parameters and metadata may still be shareable.

Stiebellehner gives the platform version. In Building Production ML Platforms at 44:05-46:32, he distinguishes tools that log a query or metadata from tools that copy the full dataset artifact. Copying a 50 GB training dataset for every run can create cost problems. It can also create GDPR deletion problems because the team may need to find and remove one person’s data across many duplicated artifacts.

Albertsson makes a similar point about lakes. In DataOps 101 at 23:29-24:38 and 1:06:01-1:08:06, raw dumps and history help teams reproduce past states, but personal data requires separation and governance. Full database dumps preserve more history than mutable tables, yet they also require clear handling for GDPR and change capture.

Capture Set

The episodes support a practical capture set rather than a universal tool checklist.

Use the capture set only when it matches the risk. A tutorial project can use a small, fully bundled dataset. A bank model, clinical dataset, or customer-facing fraud decision may need stricter metadata and lineage. It may also need stricter approval, deletion, and audit paths.

These related pages cover the operating layers around reproducibility.