Wiki

ML System Design Documents

Podcast-grounded reference for ML design docs as fail-fast, ownership, data strategy, monitoring, and fallback artifacts.

ML system design documents make an ML system’s goal and scope explicit. They also record assumptions and data strategy. They capture baselines, evaluation plans, ownership, and operational risks. Teams write them before they commit months of implementation. They’re a focused companion to Machine Learning System Design, Documentation, MLOps, and Model Monitoring.

Podcast guests converge on a practical view: the design document isn’t a paperwork gate. Teams use it to expose weak assumptions, missing data, and unclear ownership. They can also catch bad baselines, deployment gaps, and monitoring blind spots while the project is still cheap to change.

Use these wiki pages for the engineering context:

Start with these podcast interviews:

Common Definition

An ML system design document is a compact, shareable description of the system the team intends to build. It also records the risks the team must manage. In Building Scalable and Reliable Machine Learning Systems, Arseny Kravchenko argues for a lightweight design phase that splits attention between the problem and the solution. The problem side records product scenarios and goals. It also records non-goals, assumptions, constraints, and metrics.

On the solution side, teams cover the baseline, model direction, and data flow. They also cover pipeline components and data strategy.

Valerii Babushkin describes the design document as a way to fail fast in ML System Design. A project may not survive the questions raised in a few weeks of design. Teams should learn that before six or nine months of implementation. They can also use the document to see whether the project is feasible. Teams should also use it to show what must be owned and where reliability work has to happen.

Guest Differences

Arseny emphasizes early scoping and decision quality. His examples include startups, edge and mobile constraints, photostock search, and retail pricing. He also covers baseline tests, data availability, batch versus real-time flows, and system diagrams in Building Scalable and Reliable Machine Learning Systems. His design-doc advice is intentionally lightweight enough to fit small teams and ambiguous early projects.

Valerii emphasizes long-lived documentation and modular chapters in ML System Design. He also emphasizes ownership, versioning signals, and bus factor. Reliability matters in his account too. So do monitoring and fallback planning.

Nadia Nahar adds the software-engineering focus in Software Engineering for Machine Learning. ML products fail when teams treat requirements and data access as secondary concerns. Team integration, deployment expectations, and documentation need the same attention.

Goals, Non-Goals, and Assumptions

Teams narrow the problem before they discuss solution details. Arseny recommends writing goals, non-goals, and assumptions first (Building Scalable and Reliable Machine Learning Systems). Teams ask what decision the model supports and which success metric matters. They should also state what the system explicitly won’t solve and which product constraints are fixed.

By narrowing the problem first, teams avoid turning design work into a model wish list. For an edge or mobile ML system, latency and frames per second may matter more than offline accuracy. Energy use, model size, offline behavior, and platform constraints may matter more too. For pricing, search, or recommendation, business behavior and metric choice guide the architecture before model selection starts.

Data Strategy and Baselines

Guests treat the baseline as a design requirement. Arseny says the team needs a baseline and a metric. Without both, the team can’t know whether a proposed ML solution is useful (Building Scalable and Reliable Machine Learning Systems). Valerii reinforces the same practice with simple baseline solutions. They validate hypotheses quickly before the team over-invests (ML System Design).

Teams should use the data strategy to record whether the data exists. They should also explain where the data comes from and how they process it. The needed features belong there too. Teams should say whether a data lake or pipeline already supports the flow. They should also state whether the system needs batch, streaming, or online serving.

Those questions link the design document to Data Quality and Observability and Data Pipelines. An ML design without a data path hides the most common production failure modes.

Ownership and Living Documentation

Valerii repeatedly stresses that a design document is a living artifact, not a finished approval file (ML System Design). If implementation changes and the document isn’t updated, the document loses its value. His answer is modularity and accountability. Chapters or sections should have owners, and the document should make responsibility visible.

That ownership also reduces bus-factor risk. When only one person understands a model, data pipeline, or integration, the system becomes fragile. The same risk applies when one person understands the monitoring setup. Teams can use the design document to map those dependencies. They can also make maintenance work explicit.

Readers can connect this ownership work to Documentation and Software Engineering. People get value from the artifact only when it helps them coordinate real engineering work.

Monitoring and Fallbacks

Monitoring and fallback planning belong in the design document before launch. Valerii names data drift, concept drift, and prediction drift as design-doc concerns. He also includes model quality, data quality, and reliability. Monitoring, serving, and inference optimization belong there too (ML System Design). The team should decide what can go wrong and how the system should react before an incident happens.

A fallback might be a simple baseline, a previous model, or manual review. It might also be redundancy, graceful degradation, or a different serving path. The right choice depends on the product risk. Fraud detection, pricing, and search products have different failure costs.

Healthcare and education products differ too. They all need a documented answer for what happens when the model, data, or integration fails.

Requirements and Team Alignment

Nadia Nahar’s Software Engineering for Machine Learning episode broadens the design-doc conversation beyond the document. Her research discussion shows hidden technical debt from unmet requirements, poor data, and deployment gaps. It also covers CRISP-DM and Agile mismatches, siloed teams, and unclear communication. Documentation helps only when it’s part of shared vocabulary, requirements alignment, engineering support, and testing.

The people who will live with the system should review ML design documents. Reviewers should include data scientists, ML engineers, software engineers, and product owners. Platform teams, domain experts, and operations owners should review them too. A design document that only satisfies the modeling team can still fail at deployment, monitoring, or user adoption.

These pages cover adjacent ML design and delivery topics: