Podcast
From Notebooks to Production: Build Data Pipelines & Deploy ML (AWS, Kafka, Streaming)
Open original DataTalks.Club episode
From Notebooks to Production: Build Data Pipelines & Deploy ML (AWS, Kafka, Streaming)
Original Episode
Use these links for the canonical episode and media sources.
- Open the original DataTalks.Club podcast page
- Watch on YouTube
- Listen on Spotify
- Listen on Apple Podcasts
Episode Overview
How do you move models out of notebooks and into reliable production data pipelines using AWS, Kafka, and streaming architectures? In this episode, Andreas Kretz — the “Plumber of Data Science” — walks through the practical steps engineers and data scientists need to productionize notebooks and deploy ML systems.
People
Use these links to connect the episode to guest notes.
Chapter Summary
Use these checkpoints to decide whether to open the source transcript.
- 1:56 - Episode Introduction & Andreas Kretz — “Plumber of Data Science
- 3:19 - Guest Bio: Andreas’s path from software to big data and data engineering
- 5:43 - Market Trend: Why data engineering demand is rising
- 8:46 - Hiring Strategy: Hire a data scientist and engineer early
- 9:47 - Data Scientist Growth: From notebooks to production pipelines
- 12:03 - Operational Risk: Why using many tools breaks operations
- 13:25 - Data Pipeline Anatomy: Ingestion, buffer, processing, storage, visualization
- 15:11 - Ingestion Explained: Events, message queues (Kafka, Kinesis)
- 16:51 - Processing Modes: Streaming vs. batch processing
- 18:14 - One-Person Feasibility: Tooling, cloud vs on-prem, and schema design
- 21:05 - Practical Stack for Scientists: Python, Docker, Flask/FastAPI for prototypes
- 22:36 - Processing Frameworks Overview: Spark, Flink, Lambda, Glue, Docker jobs
- 24:04 - Data Transformation: Role of SQL and dataframe processing
- 25:36 - AWS Example: Parquet on S3 and processing options
- 27:22 - Case Study: Car price prediction — data sources and architecture
- 31:33 - Inference Strategy: Live API calls versus precomputed predictions
- 34:16 - Productionizing Notebooks: Dockerized training and model storage on S3
- 35:46 - Scheduling Options: Airflow vs CloudWatch/Lambda vs simple schedulers
- 37:53 - Model Serving: SageMaker endpoints and cost trade-offs
- 40:01 - Orchestration Patterns: Message queues for job sequencing
- 41:06 - Start Simple: Iterate from Lambda/queues to Airflow/Kubernetes
- 43:05 - Learning DevOps: Pick tools, read docs, and practice by doing
- 45:31 - Tool Selection: Use docs and tutorials to validate choices
- 48:36 - Early-Career Skills: Python, SQL, basic networking; AWS and OSS basics
- 51:14 - Hadoop Today: Cloud replaces Hadoop for many, but Hadoop persists in legacy
- 52:21 - LearnDataEngineering Academy: Curriculum, capstones, and resources
- 54:52 - Hands-on Projects: Build an e-commerce pipeline; use Kaggle datasets
- 57:33 - Learning Advice: Avoid huge datasets; start small and iterate
- 58:56 - Convincing Stakeholders: Build a $0 proof-of-concept and quantify ROI