Wiki

ML Platforms

Podcast-grounded reference page for shared ML platform systems, internal product strategy, and team enablement.

An ML platform is the shared internal product for teams that train and deploy models. It helps them monitor and govern those models without rebuilding the same plumbing in every project. In the podcast archive, the platform layer sits between MLOps as an operating discipline and Machine Learning Infrastructure as the compute and orchestration foundation.

The archive’s strongest platform discussions treat platforms as adoption systems, not tool catalogs. Simon Stiebellehner connects platforms to repeated deployment work, registries, metadata, and governance in Building Production ML Platforms. Geo Jolly adds the product management view in ML Product Manager and MLOps Platform Strategy. His platform users are internal data scientists, ML engineers, and stakeholders. Adoption, quality gates, and release paths decide whether that platform works.

These wiki pages cover the closest neighbors:

These podcast interviews anchor the page:

Common Definition

Across the archive, an ML platform is the reusable path from experiment to production. In Building Production ML Platforms, Simon starts from self-service compute and experiment tracking. He then moves to model registries and serving. He also covers orchestration, metadata, lineage, and governance.

In MLOps at Scale, Raphaël describes the same idea through a centralized enabling team. That team supports product teams with CI and repository standards.

Raphaël’s team also covers tests and reproducibility, then completes the operating path with serving, monitoring, and feedback loops.

The shared definition is narrower than “all ML tools” and broader than “Kubernetes for models.” Pragmatic and Standardized MLOps puts Git, CI/CD, registries, and reusable repositories at the center. Scaling Enterprise AI adds data-first readiness and governance. Post-ChatGPT AI Infrastructure adds GPU economics and distributed-training constraints.

Disagreements and Boundaries

Guests differ on timing, scope, and infrastructure depth. Simon Stiebellehner argues for platform investment when repeated training and serving problems show up across teams. Deployment and governance problems can create the same need. He also warns against heavy platform work before teams have real models and business needs (Building Production ML Platforms).

Raphaël Hoogvliets emphasizes an enabling team that earns adoption by collecting pain points and delivering quick wins. He also measures deployment frequency and impact (MLOps at Scale).

They also differ on how productized the platform function should be. Geo Jolly frames the ML platform as an internal product. His episode covers roadmap tradeoffs and user research. It also covers rollout governance, adoption metrics, and platform usability costs (ML Product Manager and MLOps Platform Strategy).

Maria Vechtomova is more skeptical of chasing new platform layers. She emphasizes Git, CI/CD, registries, and Kubernetes. She also covers cookie-cutter repositories and service principals (Pragmatic and Standardized MLOps).

Infrastructure guests draw the boundary differently again. Andrey Cheptsov treats orchestration and cloud economics as platform design constraints. He adds GPU allocation, distributed training, and Kubernetes limits (Post-ChatGPT AI Infrastructure). That view makes the platform closer to AI Infrastructure when the workload is large-model training or serving. Smaller product ML teams may stay closer to MLOps and Developer Experience.

Internal Product and Adoption

The archive repeatedly frames ML platforms as internal products, and Geo Jolly’s episode is the clearest product-management version. Platform work starts with internal users, ROI, adoption, and specs. Roadmap choices and rollout governance come next, and quality gates matter more than a generic tool wishlist (ML Product Manager and MLOps Platform Strategy).

That connects ML platforms to Self-Service Data Platforms and ML Product Manager Role. The platform succeeds only when teams can use it without constant bespoke support.

Raphaël Hoogvliets adds the operating model for adoption. His centralized MLOps team supports product teams and gathers pain points. It uses feedback loops and quick wins before pushing broader standards (MLOps at Scale). In practice, that makes Developer Experience a platform requirement. Notebooks, CI templates, model handoff paths, and deployment workflows need to reduce cognitive load.

Platform Components

The podcast archive keeps returning to a compact platform service set:

Standardization and Developer Experience

ML platform standardization is useful when it makes teams faster and safer, not when it hides all flexibility. Simon Stiebellehner describes thin abstractions over cloud providers, self-service notebooks, and deployment paths. Product teams avoid repetitive infrastructure work (Building Production ML Platforms).

Maria Vechtomova’s platform advice points in the same direction because teams can reuse Kubernetes, Git, and CI/CD. They can then add conventions, templates, and guardrails where teams repeatedly struggle (Pragmatic and Standardized MLOps).

The strongest platform pages in this repo therefore sit near Platform Engineering, GitOps for Data Teams, and MLOps Tools. The podcast evidence doesn’t support a one-size-fits-all tool stack. It supports a product-minded platform team that standardizes work teams shouldn’t repeat. Those tasks include repository layout and release paths. They also include artifact storage, access rules, monitoring hooks, and basic governance.

Governance, Reproducibility, and Risk

Enterprise and regulated teams need more than convenience tooling. Simon Stiebellehner ties platform design to metadata, lineage, and artifact logging. Security, GDPR implications, deletion rules, and unified prediction schemas also matter (Building Production ML Platforms). Those requirements connect the platform directly to Reproducibility, Governance, and Data Quality and Observability.

Alexander Hendorf’s enterprise AI discussion reinforces the same lesson from a strategy view. He emphasizes data-first readiness and retraining. Feedback loops, MLOps automation, and long-term platform selection also matter. His episode also ties enterprise MLOps to standardization and CI/CD. Governance and reproducibility sit in the same platform discussion (Scaling Enterprise AI).

In these discussions, governance isn’t a separate compliance afterthought. It is part of the release path for moving models into production.

Compute, Orchestration, and AI Infrastructure

The platform boundary expands when model workloads put pressure on compute and orchestration. Simon Stiebellehner names cloud infrastructure, Kubernetes, Terraform, and managed compute as core platform skills. He also names notebooks and batch jobs. Online serving and pipeline orchestration sit in the same platform skill set (Building Production ML Platforms).

Raphaël Hoogvliets adds dependency compatibility and package registries. Docker images, Kubernetes, and Databricks affect reproducibility and team autonomy (MLOps at Scale).

Andrey Cheptsov’s AI infrastructure episode pushes the platform discussion into GPU economics and distributed training. It names PyTorch, NCCL, and communication bottlenecks. It also covers optimization and cloud versus on-prem decisions. Kubernetes limits and Slurm-like scheduling needs become platform concerns. Bare-metal provisioning does too (Post-ChatGPT AI Infrastructure).

For large model teams, the ML platform can’t be separated cleanly from AI Infrastructure. For more typical product ML teams, the platform may stay focused on Machine Learning System Design, model packaging, and deployment paths. Monitoring and feedback loops still matter.

Use these pages for narrower lifecycle, product, and infrastructure topics.