Podcast
DataOps 101 for Scaling Data Platforms: Immutable Pipelines, Self-Service Lakehouse & Reproducibility
Open original DataTalks.Club episode
DataOps 101 for Scaling Data Platforms: Immutable Pipelines, Self-Service Lakehouse & Reproducibility
Original Episode
Use these links for the canonical episode and media sources.
- Open the original DataTalks.Club podcast page
- Watch on YouTube
- Listen on Spotify
- Listen on Apple Podcasts
Episode Overview
How do you scale a data platform that supports self-service analytics while keeping pipelines reproducible and maintainable? In this episode, Lars Albertsson, founder of Scling and former Google, Spotify and Schibsted engineer, walks through pragmatic DataOps principles for building scalable data platforms.
People
Use these links to connect the episode to guest notes.
Chapter Summary
Use these checkpoints to decide whether to open the source transcript.
- 2:39 - Episode Opening & Guest Introduction
- 3:38 - Career Journey: Google, Spotify, Consulting and Scling
- 7:52 - Scaling Data Teams: Building Self-Service at Spotify
- 10:48 - Orchestration Spotlight: Luigi as a Data Build System
- 11:50 - DataOps Defined: Enablement, Workflows and People Alignment
- 16:42 - Data Platform Principles: Immutability & Functional Architecture
- 20:12 - Reproducibility Problems: Mutable ETL vs Immutable Pipelines
- 21:29 - Data Lake vs Data Warehouse: Raw Data, Aggregates & Use Cases
- 23:29 - Data Lake Fundamentals: Object Storage, Governance & Raw Dumps
- 28:22 - Ingress & Egress: Offline Processing and Self-Service SQL
- 30:34 - Core Platform Components: Storage, Compute & Workflow Engine
- 31:18 - Compute Options: Spark, Flink, Containers and Managed Services
- 35:57 - Cloud Trade-offs: Prepackaged Platforms vs DIY Assembly
- 39:57 - Recommended Reading: Lambda Architecture, Practical DataOps & Scling List
- 41:53 - Batch vs Streaming: Latency Tradeoffs and Typical Use Cases
- 45:11 - Micro-batching vs Streaming: Dependency Management & Predictability
- 46:52 - DataOps Maturity: Test-Certified Practices, Quality & Schema Automation
- 50:13 - Enabling Self-Service Analytics: Embedding Engineers with Analysts
- 53:31 - MLOps vs DataOps: Shared Principles and ML-Specific Requirements
- 57:46 - Data Mesh Overview: Decentralization, Ownership & Governance Risks
- 1:03:02 - Splitting the Platform: When to Decentralize vs Centralize
- 1:04:18 - Lineage & Versioning: Code-Defined Pipelines vs Catalog Tools
- 1:06:01 - Database Versioning: Full Dumps, CDC (Change Data Capture) Strategies
- 1:07:52 - Lakehouse Architecture: Warehouse Features Layered on Data Lake
- 1:11:01 - Further Resources: Scling Reading List & Presentations