Wiki

ML Infrastructure

Training, feature/data/model pipelines, registries, batch and online serving, monitoring, and platform foundations for production machine learning systems.

Related Wiki Pages

ML Platforms Platform Engineering AI Infrastructure ai-infrastructure-cost-and-ownership MLOps Model Monitoring Model Registry Orchestration

Machine learning infrastructure gives teams the components they need for classical ML systems. It supports training, packaging, deployment, and monitoring. It also covers compute and storage, feature and data pipelines, and model artifacts and registries. Orchestration, serving, and networking belong in the same base layer. It’s the technical base for ML Platforms, MLOps, and Machine Learning System Design.

Machine-learning infrastructure work asks what has to exist under the ML lifecycle. It also asks where those components fail under scale, regulation, latency, or cost pressure. ML Platforms owns the shared internal product surface that turns those components into a supported path for data scientists and ML engineers. The ML platform engineer role sits at that handoff from infrastructure pieces to a user-facing platform.

The skill set spans cloud infrastructure, notebooks, Kubernetes, and Terraform. It also covers managed compute, batch inference, online serving, and orchestration (^[1]). Infrastructure is therefore broader than model serving but narrower than the whole platform product.

Effective Data Science Infrastructure by Ville Tuulos covers the same compute, orchestration, and serving layers from the data science side, built around his Metaflow experience.

Vin Vashishta frames the ML architect’s infrastructure work as a business translation role. The architect turns user, customer, and business requirements into a platform vision. They check whether existing systems can support the work and estimate what production and maintenance will cost (^[2]). That puts infrastructure decisions close to ML Product Manager Role because buy-versus-build and platform reuse can decide whether a model-backed product deserves funding.

Infrastructure Baseline

The MLOps toolset also includes release and reproducibility concerns. Those concerns include experiment tracking and a model registry, serving and monitoring, and package registries with deployment compatibility (^[3]). Docker, Kubernetes, and Databricks matter here because a model artifact isn’t enough if runtime images and dependencies drift.

Large-model and LLM product workloads push the topic toward AI Infrastructure. That shift happens when inference APIs or retrieval dominate. Evaluation, GPU capacity, and cloud-versus-on-prem cost can force the same move (^[4]). Here, the classical lifecycle means data and feature pipelines plus training jobs. It also covers artifacts and registries, serving, and monitoring.

Platform Timing and Scale

The disagreement is less about components than about timing and scale. Infrastructure work starts when a workload needs reliable compute, storage, release paths, or runtime ownership. Platform investment pays off later when teams repeat deployment, serving, governance, and registry work across projects. Building heavy platform pieces too early is a mistake because real models and business needs come first (^[1]). For the platform-product side of that decision, see ML Platforms.

A centralized MLOps team reframes adoption by gathering pain points, supporting product teams, and measuring value before standardizing too much (^[3]). The infrastructure succeeds when ML teams use it repeatedly and can trace value back to release speed, reproducibility, or operational reliability.

For smaller production systems, the boundary sits lower. Start with Lambda and queues before moving toward Airflow or Kubernetes when the workload doesn’t yet justify heavier orchestration (^[5]).

Finance regulation can impose the opposite constraint: ML teams may work on Hadoop and OpenShift rather than self-service cloud. Linux and networking then become part of the infrastructure skill set. SSH/SCP, firewall requests, and internal platform behavior matter too (^[6]).

LLM product work points the other way. When hosted inference and retrieval dominate, the decision moves into AI Infrastructure. The same shift applies to evaluation, GPU cost, distributed training, and bare-metal scheduling (^[4]).

Vashishta adds a roadmap lens to the same infrastructure decision. A platform purchase may look too expensive for one project but become justified when it supports several products over one to three years. The architect’s job is to compare existing infrastructure and cloud options. They also compare on-prem constraints and product roadmap reuse before the team commits to a path (^[7]).

Compute for Training and Batch Work

Compute starts with ordinary cloud resources for notebooks, training jobs, and batch work. AWS, GCP, and Azure are platform engineering skills. Kubernetes, Terraform, and managed compute belong there too (^[1]). The developer experience goal is practical: teams need compute access without opening a support ticket for every run.

In regulated finance, compute can become an on-prem platform constraint. Teams may run Hadoop and OpenShift instead of elastic cloud services. They may also request hardware through internal processes. Deployment work has to fit approved platforms, so infrastructure ownership becomes part of governance (^[6]).

Some ML workloads add GPU requirements, but classical infrastructure still asks whether teams can get approved compute. It also asks whether they can run training and batch jobs, store artifacts, and reproduce the environment later. When the dominant problem becomes large-model distributed training, communication bottlenecks, or the cost tradeoff between cloud and on-prem hardware, use AI Infrastructure (^[4]).

This is where machine learning system design becomes more than an API and database exercise. A design has to say whether the model trains on a managed service, a Kubernetes cluster, a Databricks job, or another approved runtime. It also has to explain when managed compute is enough and when infrastructure ownership becomes a real constraint.

Storage and Artifact Management

ML infrastructure stores training snapshots and features alongside raw data. It also stores model files and Docker images. Experiment metadata, prediction logs, and deployment artifacts belong in the same layer. This layer ties to experiment tracking and model registries (^[1]). Those systems create the handoff from training to batch inference, online serving, and audit.

Package registries and dependency compatibility matter because the model artifact alone isn’t enough if the runtime image changes. Python packages and deployment dependencies can drift too. Container strategy with Docker and Kubernetes affects reproducibility and team autonomy (^[3]).

The simpler production path stores Parquet on S3, Dockerizes training, and persists model files where later jobs can load them (^[5]). The team needs stable storage for data, code, and models before it can reason about Reproducibility.

Orchestration and Scheduling

Orchestration coordinates training and evaluation, plus inference, retraining, and data movement. Airflow and pipelines sit inside production workflows, which links ML infrastructure to data pipelines and Batch vs Streaming (^[1]). Some models need scheduled batch scoring, while others need online inference or streaming features.

Metaflow is a workflow-tool example. It integrates across AWS, Kubernetes, and Argo (^[8]). Data scientists shouldn’t have to assemble the cloud and workflow stack from scratch before they can run a reproducible ML flow.

Kubernetes is useful but not a universal answer because AI workflows may need SLURM (^[4]). At the opposite scale, Lambda and queues or simpler schedulers fit when the workload doesn’t justify heavier orchestration (^[5]).

Simulation-heavy work adds a pre-ML boundary. Teams need infrastructure that moves data to high-performance clusters and retrieves results. They also need to keep competing client datasets separate before models or pipelines use the outputs. For those workloads, engineers treat simulation and digital-twin systems as orchestration work rather than model serving alone (^[9]).

Serving and Deployment

Serving infrastructure turns trained models into predictions through two recurring deployment shapes: batch inference and online serving (^[1]). Batch inference can run as scheduled jobs. Online serving needs request-time latency, logging, API contracts, and rollback paths.

A concrete product example chooses between live API calls and precomputed predictions. It then weighs SageMaker endpoints and cost tradeoffs (^[5]).

Serving is a business and latency decision, not just a framework choice. Classical ML systems usually choose between scheduled scoring and request-time prediction. Streaming features, edge execution, and hybrid paths add more options. For LLM serving, token volume and caching matter. Model selection can move the question to AI Infrastructure, as can hosted APIs and LLM cost optimization.

Edge and mobile serving push deployment constraints even further. Offline mobile models are still a mostly manual deployment space today. Vendors extend Kubernetes toward edge devices so model and application updates can be scheduled closer to the user (^[10]). That puts edge deployment beside orchestration, Model Monitoring, and runtime ownership rather than treating it as only an app packaging problem.

Teams in low-resource healthcare face a clinical infrastructure decision because connectivity and local hardware can vary. Before a team can claim the model fits the care setting, it may have to choose between cloud inference and on-device execution. That ties serving infrastructure to healthcare ML validation and local operations, not only latency. A pediatric monitoring device in a hospital with intermittent internet may need local inference and local update procedures. Its runtime also has to fit the rest of the device software ^[11].

Deployment ties to release discipline, so the MLOps toolset includes CI and repository structure. It also includes parameterization, tests, and serving (^[3]). Infrastructure should therefore support both the runtime and the release path that gets code into that runtime.

Monitoring and Feedback Loops

Monitoring links deployment to maintenance. The core challenge is keeping models deployed, monitored, and maintained. CI/CD and tests also need ties to traceability, experiment capture, and monitoring (^[3]). Model monitoring belongs on the infrastructure page because the runtime needs logs, metrics, alerts, and ownership.

Governance and observability requirements also influence infrastructure design. They include metadata and lineage, GDPR constraints, deletion rules, and unified prediction schemas (^[1]). Prediction logs should support monitoring and analytics, but they also need security and data-governance controls.

For lifecycle ML, monitoring also has to connect predictions back to training data and feature versions. It also needs model versions, labels, and downstream outcomes. When the operating question becomes hosted API behavior or GPU utilization, the same monitoring concern moves into AI Infrastructure (^[4]). Retrieval quality and AI product cost can push it there too.

In Python stock analysis, the same infrastructure question appears as scheduled market-data jobs and feature calculation. It also needs prediction records and position decisions. That workflow needs logs for data arrival, model version, and execution context before monitoring can explain a bad decision ^[12].

Infrastructure Handoff to Platform Teams

Infrastructure becomes valuable when teams can use it without becoming infrastructure specialists. A user-centric platform starts from data science workflows and notebooks, then adds thin abstraction layers over cloud providers (^[1]). The lower layer has to make cloud resources, runtimes, and schedulers reliable. Images and observability controls have to work too before the platform can expose them.

The team model behind that experience is a centralized MLOps team supporting product teams and ML engineers. It starts with CI/CD and tangible pain points (^[3]). Infrastructure ownership becomes a service model, not only a cluster-maintenance job. ML Platforms covers the product roadmap, self-service workflow, and adoption side of that service model.

Metaflow shows the open-source developer experience version. Its flow abstraction sits across AWS, Kubernetes, and Argo (^[8]).

AWS, Kubernetes, and Argo still have to work before the abstraction can feel simple. Storage and execution environments matter too. The user works through a tool that fits ML workflows. The infrastructure layer keeps the underlying execution path dependable.

Key neighboring pages:

DataTalks.Club