Podcast
Post-ChatGPT AI Infrastructure: Open Source Orchestration, On-Prem Economics & Distributed Training at Scale
Open original DataTalks.Club episode
Post-ChatGPT AI Infrastructure: Open Source Orchestration, On-Prem Economics & Distributed Training at Scale
Original Episode
Use these links for the canonical episode and media sources.
- Open the original DataTalks.Club podcast page
- Watch on YouTube
- Listen on Spotify
- Listen on Apple Podcasts
Episode Overview
How has the rise of ChatGPT reshaped the infrastructure needed to build and run large language models, and when does open source orchestration make sense compared to cloud or proprietary systems? In this episode we speak with Andrey Cheptsov, founder and CEO of dstack — an open-source alternative to Kubernetes and Slurm designed to simplify AI infrastructure orchestration. Drawing on his decade-plus at JetBrains building developer tools, Andrey frames practical trade-offs between on-prem economics and cloud spend,.
People
Use these links to connect the episode to guest notes.
Chapter Summary
Use these checkpoints to decide whether to open the source transcript.
- 0:00 - Episode Kickoff & Guest Introduction
- 2:46 - Career Background: JetBrains, DataSpell, and Move into AI
- 5:27 - Origins of DStack: Reducing AI Infrastructure Cost of Ownership
- 8:25 - Cloud vs On-Prem Costs and MLOps Limitations (SageMaker example)
- 10:00 - Cloud-to-On-Prem Realities in the Post-ChatGPT Era
- 12:58 - Choosing Open Source: Developer Tools, Feedback, and Community
- 17:33 - Open vs Proprietary Models: Business Models and Trade-Offs
- 21:37 - Decentralization in AI: Privacy, Control, and Industry Fit
- 30:16 - Training at Scale: GPU Requirements and Distributed Challenges
- 34:46 - Distributed Training Stack: PyTorch, NCCL, and Communication Bottlenecks
- 37:35 - Efficiency Over Brute Force: Optimization Strategies and DeepSpeed
- 39:30 - Fine-Tuning & Serving Models for Non–AI-First Companies
- 47:16 - Orchestration Gaps: Kubernetes Limitations for AI Workflows and SLURM
- 50:59 - Kubernetes as the Deployment Standard vs Smaller Alternatives
- 51:56 - Hybrid Infrastructure Outlook: Cloud Dominance and On-Prem Nuances
- 54:31 - On-Prem GPU Coordination: SSH, Resource Contention, and Real Examples
- 56:53 - Bare-Metal as a Service: Provisioning, Automation, and Firmware Management
- 58:07 - Edge Computing Scope: Devices, Local Models, and Definition Ambiguity
- 1:00:30 - Federated Learning vs Distributed Compute: Practicality and Use Cases
- 1:02:51 - Closing Pick: Science-Fiction Recommendation — The Three-Body Problem
- 1:05:38 - Episode Wrap-Up & Links to DStack and Guest Resources