Podcast

From Notebooks to Production: Build Data Pipelines & Deploy ML (AWS, Kafka, Streaming)

S4E2

Open original DataTalks.Club episode

YouTube Spotify Apple Podcasts

data engineering machine learning production tools

From Notebooks to Production: Build Data Pipelines & Deploy ML (AWS, Kafka, Streaming)

Original Episode

Use these links for the canonical episode and media sources.

Open the original DataTalks.Club podcast page
Watch on YouTube
Listen on Spotify
Listen on Apple Podcasts

Episode Overview

How do you move models out of notebooks and into reliable production data pipelines using AWS, Kafka, and streaming architectures? In this episode, Andreas Kretz — the “Plumber of Data Science” — walks through the practical steps engineers and data scientists need to productionize notebooks and deploy ML systems.

People

Use these links to connect the episode to guest notes.

Andreas Kretz

Chapter Summary

Use these checkpoints to decide whether to open the source transcript.

1:56 - Episode Introduction & Andreas Kretz — “Plumber of Data Science
3:19 - Guest Bio: Andreas’s path from software to big data and data engineering
5:43 - Market Trend: Why data engineering demand is rising
8:46 - Hiring Strategy: Hire a data scientist and engineer early
9:47 - Data Scientist Growth: From notebooks to production pipelines
12:03 - Operational Risk: Why using many tools breaks operations
13:25 - Data Pipeline Anatomy: Ingestion, buffer, processing, storage, visualization
15:11 - Ingestion Explained: Events, message queues (Kafka, Kinesis)
16:51 - Processing Modes: Streaming vs. batch processing
18:14 - One-Person Feasibility: Tooling, cloud vs on-prem, and schema design
21:05 - Practical Stack for Scientists: Python, Docker, Flask/FastAPI for prototypes
22:36 - Processing Frameworks Overview: Spark, Flink, Lambda, Glue, Docker jobs
24:04 - Data Transformation: Role of SQL and dataframe processing
25:36 - AWS Example: Parquet on S3 and processing options
27:22 - Case Study: Car price prediction — data sources and architecture
31:33 - Inference Strategy: Live API calls versus precomputed predictions
34:16 - Productionizing Notebooks: Dockerized training and model storage on S3
35:46 - Scheduling Options: Airflow vs CloudWatch/Lambda vs simple schedulers
37:53 - Model Serving: SageMaker endpoints and cost trade-offs
40:01 - Orchestration Patterns: Message queues for job sequencing
41:06 - Start Simple: Iterate from Lambda/queues to Airflow/Kubernetes
43:05 - Learning DevOps: Pick tools, read docs, and practice by doing
45:31 - Tool Selection: Use docs and tutorials to validate choices
48:36 - Early-Career Skills: Python, SQL, basic networking; AWS and OSS basics
51:14 - Hadoop Today: Cloud replaces Hadoop for many, but Hadoop persists in legacy
52:21 - LearnDataEngineering Academy: Curriculum, capstones, and resources
54:52 - Hands-on Projects: Build an e-commerce pipeline; use Kaggle datasets
57:33 - Learning Advice: Avoid huge datasets; start small and iterate
58:56 - Convincing Stakeholders: Build a $0 proof-of-concept and quantify ROI