Podcast

DataOps 101 for Scaling Data Platforms: Immutable Pipelines, Self-Service Lakehouse & Reproducibility

S2E11

Open original DataTalks.Club episode

YouTube Spotify Apple Podcasts

DataOps date engineering MLOps

DataOps 101 for Scaling Data Platforms: Immutable Pipelines, Self-Service Lakehouse & Reproducibility

Original Episode

Use these links for the canonical episode and media sources.

Open the original DataTalks.Club podcast page
Watch on YouTube
Listen on Spotify
Listen on Apple Podcasts

Episode Overview

How do you scale a data platform that supports self-service analytics while keeping pipelines reproducible and maintainable? In this episode, Lars Albertsson, founder of Scling and former Google, Spotify and Schibsted engineer, walks through pragmatic DataOps principles for building scalable data platforms.

People

Use these links to connect the episode to guest notes.

Lars Albertsson

Chapter Summary

Use these checkpoints to decide whether to open the source transcript.

2:39 - Episode Opening & Guest Introduction
3:38 - Career Journey: Google, Spotify, Consulting and Scling
7:52 - Scaling Data Teams: Building Self-Service at Spotify
10:48 - Orchestration Spotlight: Luigi as a Data Build System
11:50 - DataOps Defined: Enablement, Workflows and People Alignment
16:42 - Data Platform Principles: Immutability & Functional Architecture
20:12 - Reproducibility Problems: Mutable ETL vs Immutable Pipelines
21:29 - Data Lake vs Data Warehouse: Raw Data, Aggregates & Use Cases
23:29 - Data Lake Fundamentals: Object Storage, Governance & Raw Dumps
28:22 - Ingress & Egress: Offline Processing and Self-Service SQL
30:34 - Core Platform Components: Storage, Compute & Workflow Engine
31:18 - Compute Options: Spark, Flink, Containers and Managed Services
35:57 - Cloud Trade-offs: Prepackaged Platforms vs DIY Assembly
39:57 - Recommended Reading: Lambda Architecture, Practical DataOps & Scling List
41:53 - Batch vs Streaming: Latency Tradeoffs and Typical Use Cases
45:11 - Micro-batching vs Streaming: Dependency Management & Predictability
46:52 - DataOps Maturity: Test-Certified Practices, Quality & Schema Automation
50:13 - Enabling Self-Service Analytics: Embedding Engineers with Analysts
53:31 - MLOps vs DataOps: Shared Principles and ML-Specific Requirements
57:46 - Data Mesh Overview: Decentralization, Ownership & Governance Risks
1:03:02 - Splitting the Platform: When to Decentralize vs Centralize
1:04:18 - Lineage & Versioning: Code-Defined Pipelines vs Catalog Tools
1:06:01 - Database Versioning: Full Dumps, CDC (Change Data Capture) Strategies
1:07:52 - Lakehouse Architecture: Warehouse Features Layered on Data Lake
1:11:01 - Further Resources: Scling Reading List & Presentations