Wiki
ML Platform Engineer Role
Podcast-backed definition of the ML platform engineer role: internal ML platforms, developer experience, MLOps services, infrastructure tradeoffs, and boundaries with MLOps and ML engineering.
Related Wiki Pages
An ML platform engineer builds the shared path that model builders use to train, deploy, monitor, and govern machine learning systems. The role sits between MLOps, platform engineering, and machine learning engineering. It’s less about owning one model and more about making many model teams faster and safer.
Simon Stiebellehner describes the platform version directly. His episode starts with deployment blockers and moves into cloud infrastructure, Kubernetes, and Terraform. It also covers data science workflows, experiment tracking, and model registries. Serving and orchestration come next. Metadata, lineage, governance, and prediction logging follow (ML platform episode at 6:55-54:15).
Common Definition
The common definition is internal platform ownership for ML work. The platform engineer gives data scientists and ML engineers a reliable way to access compute and track experiments. They also help teams persist models, deploy predictions, and observe runtime behavior. That connects the role to ML Platforms, Experiment Tracking, and Model Registry.
Simon frames MLOps as people, workflow, and technology. He then describes a platform as the reusable system around the data science workflow. Teams start with exploration, training, and evaluation. They continue through self-service compute, tracking, and registries. Serving and orchestration follow (ML platform episode at 4:42-34:01).
Krzysztof Szafanek gives
the engineer-as-consultant version from Zalando. His ML platform work includes
the zflow library and pipeline architecture. Onboarding, training, and user
support also belong in the role
(ML engineering career episode at 13:25-17:48).
That makes developer experience
part of the role, not a separate nice-to-have.
Operational ownership also belongs in the role. Simon ties team size and on-call expectations to ML platform staffing, which means the platform isn’t only a set of libraries. Someone has to support the path when training, deployment, serving, or monitoring fails (ML platform episode at 15:34).
Guest Differences
Guests differ on when platform work is justified. Simon warns against building a heavy platform before real models and repeated business needs exist. He recommends looking for standardization triggers across teams and building minimal platform pieces alongside actual use (ML platform episode at 16:52-20:04 and 47:08-49:19).
Raphael Hoogvliets gives an enabling-team version in MLOps at Scale. His team supports product teams, gathers pain points, and earns adoption through quick wins. That version is closer to internal consulting and platform product management.
Geo Jolly makes the product management layer explicit. In his ML platform strategy episode, internal data scientists and analysts are customers. User feedback and platform usability guide the roadmap. Observability KPIs, release governance, and rollout timing guide it too. Surveys and shadowing add more evidence (ML platform PM episode at 11:24-57:20).
Maria Vechtomova is more pragmatic about tool choice. Her standardized MLOps discussion emphasizes Git, CI/CD, registries, and Kubernetes. It also emphasizes reusable repositories and existing engineering primitives before adding more platform layers (standardized MLOps episode).
Core Work
ML platform engineers usually own self-service paths. Simon’s episode names notebooks and BigQuery as examples of self-service compute. Databricks provisioning appears there too. He then moves into experiment tracking as an early reproducibility win. Model registries handle the handoff from training to downstream use (ML platform episode at 28:20-30:32).
Serving is another core area. Platform teams may support batch inference, online serving, scheduled jobs, or APIs. Teams choose between them based on latency, freshness, cost, and ownership. Simon discusses batch versus online serving and orchestration choices in the same platform context (ML platform episode at 31:15-34:01).
The platform also needs governance and observability. Simon connects regulatory constraints, metadata, lineage, and data governance to the platform role. API design and unified prediction schemas belong there too (ML platform episode at 39:54-54:15). That links the role to Model Monitoring, Governance, and Reproducibility.
Some platform components are situational. Feature stores are useful when teams need feature reuse, online serving, validation, and governance across repeated tabular ML use cases. They can be overkill when the team has no real-time feature need or shared feature lifecycle (feature store episode at 21:00-52:00). The same caution applies to larger platform choices. Start from repeated pain, not tool fashion.
Skills
The role needs cloud and infrastructure fluency. Simon names cloud infrastructure and Kubernetes as core platform skills. Terraform belongs in the same skill set. Software engineering does too (ML platform episode at 8:11-13:50).
It also needs enough ML workflow knowledge to understand notebooks and training runs. Evaluation, model handoffs, and deployment friction matter too.
Krzysztof’s episode adds durable engineering habits. SQL, Git, shell, and debugging remain valuable as ML tooling changes. Engineers also need T-shaped expertise and troubleshooting skill (ML engineering career episode at 29:00-37:37). Those skills matter because platform work often fails in integration details, not in isolated demos.
The role also needs ML literacy, but it doesn’t require being the strongest modeler on the team. Simon discusses when platform engineers should learn model internals. Krzysztof frames the useful profile as T-shaped (ML platform episode at 51:41, ML engineering career episode at 35:23).
The boundary with an MLOps engineer is porous. MLOps can be the operating discipline around one model or team. ML platform engineering turns repeated MLOps needs into shared internal services.
The boundary with a machine learning engineer is product ownership. ML engineers often own one model-backed capability. Platform engineers own the paved paths that many such capabilities use.
Related Pages
These pages cover adjacent roles, platform services, and operating concerns.