MLOps Zoomcamp: Free MLOps course. Register here!


Designing Machine Learning Systems

by Chip Huyen

The book of the week from 27 Jun 2022 to 01 Jul 2022

Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they’re data dependent, with data varying wildly from one use case to the next. In this book, you’ll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

Author Chip Huyen, co-founder of Claypot AI, considers each design decision–such as how to process and create training data, which features to use, how often to retrain models, and what to monitor–in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.

This book will help you tackle scenarios such as:

  • Engineering data and choosing the right metrics to solve a business problem
  • Automating the process for continually developing, evaluating, deploying, and updating models
  • Developing a monitoring system to quickly detect and address issues your models might encounter in production
  • Architecting an ML platform that serves across use cases
  • Developing responsible ML systems

Questions and Answers


Hello, what do you think about end to end ML solutions? Tools that will help you do more than just one thing (for instance: training, monitoring, and serving). Are we going this way or specialized libraries (that do one thing exceptionally) for each task are the future?


Also, one more thing. Do you cover testing ML models before putting it in production in your book?
Thanks for coming here and answering some questions!

Chip Huyen

yes!! half of chapter 6 and half of chapter 9!

Dr Abdulrahman Baqais

Hi, Is this book targeting data scientists or machine learning engineers more closely?

Chip Huyen

depends on how you define data scientist / ML engineer. they’re defined differently at different orgs
this book discusses ml systems as whole, so i believe it has a lot of relevant points for both DS and MLEs

Dr Abdulrahman Baqais

Can the design be applied to deep learning , NLP, recommendation systems?

Chip Huyen


Dr Abdulrahman Baqais

Does the book explore preML-modeling stages such as Data Engineering? Modern data stack?

Chip Huyen

yes, chapter 3 discuss data engineering

Dr Abdulrahman Baqais

Does it explore ML systems in the cloud? For embedded deviced? or at the edge?

Chip Huyen

yes, that’s what chapter 7 is about

Shaksham Kapoor

First of all, I would like to thank you Chip Huyen for all that you have done (and still do) for the ML community 🙂 🙏 I learn something new almost every time I read your blog.
My question to you is - what do you think are the bare essentials that a data scientist need to know?
> Context:
> I have seen poorly defined JDs where the expectations from a data scientist is to literally know anything and everything between ML and DevOps. The issue that I see with such roles is that these are separate fields, and getting a mastery over either of them is challenging (let alone both) given the rapid pace in which the fields evolve. I am a core data science person and I am good at it; however, I know that there are still a lot of things in data science (alone) that I need to learn to become a complete data scientist.
> Having said that, I also want to gain experience in other components of ML lifecycle, but where do you draw the line when you say “this much” of DevOps is sufficient for a data scientist to know.

Chip Huyen

thanks Shaksham Kapoor for your kind words!

Chip Huyen

agree that JDs can be ambiguous. at the same time, different orgs have different challenges. 2 data scientists at 2 different companies, even same company on different teams, can do very different things.
i’d work at it backwards. figure out what teams you want to join, read their tech blogs / look at their JDs / talk to people on their team and figure out what problems they care about

Shaksham Kapoor

I agree. I have observed the following in my experience:-

  • Startups (any stage) - they require an all-rounder, who knows anything and everything.
  • Mid-size firms (>1000 to <=10000) - there is a split. Some require an all-rounder, while others don’t.
  • Large-size firms (>=10000) - they don’t need an all-rounder but an expert. I believe the main reason for this is the presence of separate teams (sometimes departments) that takes care of different components of ML, so one can get a chance to really focus on core ML stuff.
Dan Gurin

What do you perceive as the biggest challenges / opportunities on the horizon for the future of machine learning?

Chip Huyen

real-time ML!

Jeanine Harb

Hello Chip Huyen! Thank you for this book, can’t wait to read it!
From what I’ve seen in the field, a lot of ML projects have a hard time getting into production. In your opinion, what are the biggest hurdles and how can companies/organizations overcome them?

Chip Huyen

nice to meet you Jeanine! i think the biggest hurdle is that companies don’t invest enough into infra to enable data scientists to do their jobs. ml production is largely an infra problem.
hiring infra engineers doesn’t solve it though. infra engineers will need to work closely with data scientists to understand their workflows. not all companies have their communication channels set up to enable cross-functional team communication.

Sergio Rozada

Hi Chip Huyen! It’ll be a pleasure to read your book, seems really insightful!! I would love to hear your opinion about batch vs online inference. Thank you very much!

Chip Huyen

i have 10 pages in my book discussing batch vs. online prediction alone so not sure how to respond to this in a message 😅

Contato Rupp

Hi Chip Huyen! In your opinion, which aspect of ML system designs is particularly hard to iterate on? and when should teams think about a complete redesign of their systems?

Chip Huyen

devops track metrics such as:

  • change failure rate: how often does new update fail?
  • time to detection: how long does it take to detect a problem
  • time to response: after a problem is detected, how long does it take to address it
    we should track similar metrics in mlops. how often does new models fail? how long does it take to discover new model failure? how long does it take to update models / debug the systems?
    teams should think about redesign whenever those metrics fail short of their expectations.
    and if they can’t track those metrics, they should definitely consider redesign 🙂
Shaksham Kapoor

Another question for you Chip Huyen - how do you keep yourself updated with the breath and depth of techniques that are available/upcoming in data science?

Chip Huyen

by working with people smarter than me!


Hi Chip, thank you for the work you do and the QnA! I hope to earn an in-person interaction opportunity with you someday!
Q: What are some strategies for data collection to approach problems. What is the role of data ethics in the process?

Chip Huyen

great question. data ethics is a nuanced topic that i don’t think i can discuss in one message. chapter 4 covers data collection


Q: What are few critical points to keep in mind for scalability of the system? Should scalability be planned for in the initial design of the ML system or can it be planned in future iterations when the need arises?

Chip Huyen

depends on when you anticipate the scaling issues to arise: a week from now or a year from now?
i’m a big fan of scalability, but i’m not a fan of premature optimization


Thank you for the insight! I agree; it is more important to get things running than implement the system to perfection.

Q: For someone trying to dive deeper into ML, how should they pick a specific branch to attain mastery. I am interested in a number of branches in ML (NLP CV Time series …), but unable to choose one specifically. Do you suggest a framework for the same?
Chip Huyen

you know the venn diagram that says to choose the intersection of:

  1. what you’re interested in
  2. what the world needs
  3. what you’re good at
    it sounds like all branches of ML are pretty needed right now (though time series and tabular data might be a little bit underrated compared to CV & NLP)
    i’d choose based on what problems interest me and what i’m good at

I think it’s called Ikigai . Will keep this framework in mind going further on

Chip Huyen

good luck!

Ashish Lalchandani

Hello Chip Huyen! Thanks for being here! My question is :

  1. As someone who is looking to switch from data engineering to MLOps, what are the skills required for it and what is the best way to prepare for interviews? By focusing more on projects?
  2. How to determine the right amount of MLOps for a project?
Chip Huyen
  1. yes, real-world projects. but it can be difficult if you don’t already work at any company already.
  2. the right amount is whatever is needed to achieve the project’s goals!
    i also wrote a free book on interviews – hope that helps!
Ashish Lalchandani

Oohh great, you have written on interviews too, awesome! Thanks for contributing so much to the community, much appreciated!

Sergio Rozada

Hello Chip Huyen, another question here, can you share any insight into how to make scalable systems with large models (e.g. transformers)? Thank you very much!!

Chip Huyen

hi Sergio, thanks for your question! i’d need more context on:

  • how your system fails at scale (e.g. higher latency, higher cost)
  • where does your system fail when scaling (e.g. if it’s high latency at inference, is it due to network latency or model latency), etc….
Sergio Rozada

Hi Chip, thanks your answer, for the sake of clarity:

  • You just hit the point! In our case, we struggle a lot to improve in terms of latency keeping the costs controlled. We’re running on a Kubernetes cluster in GCP (managed by us).
  • Our main latencies are model latencies.
Utku Savaş

Hi Chip Huyen, how should we perform monitoring on computer vision related projects? Thank you.

Chip Huyen

this would be a looooong answer 😭 maybe we can start with: what problems do you have with monitoring CV projects rn?

Utku Savaş

Hi, I can check model results on UI but I want to monitor whether data drifting(or similar problems) occurs or not on our projects. Which programs or approachs should I use to notice data drifting before it happens on image data?

Tim Becker

Hello Chip Huyen, thanks for being here. I would like to know:

  • What are frequent mistakes when designing ML systems and how to avoid them?
  • What kind of considerations should I take in mind when setting up automated re-training? How often should it be done? Only if the quality of the model decreases?
  • How do you document your ML projects?
Chip Huyen

hi Tim, thanks for your questions!
i wish there were shorter answers to those questions, but they are complicated 😭 i have a pretty long section in my book on the myths of ml deployment and half of a chapter on question 2 (which includes a section discussing how often to update models). here’s the summary
Eugene Yan has a lot more thoughts on documenting ML projects 👀

Warrie Warrie

Hello Chip Huyen. Thanks for investing time here.

  1. As an academic researcher and application engineer, what areas in ML system design have you observed to be more disparity between the 2 industries?
  2. How critical is software engineering skill in designing an ML system?
Chip Huyen

hii Warrie, i gave a talk on 1 here
engineering is critical for designing ml systems



Max Payne

How different is the design process for traditional ML vs DL?

Chip Huyen

i think the difference is less between ML vs. DL but between different projects, goals, and constraints


> Q: lets say you are building a model with X technique which is in production but now you see that Y technique which is state of the art outperforms better than X during development do you remove model X and replace with model Y?
Hello Chip, this question was earlier asked by a fellow member in another channel of the DTC workspace. I am interested to learn of your opinion
Source (with suggestions from other members):
cc: Doink

Chip Huyen

i’d agree with the first answer there: run experiments (both online and offline) to compare the 2 models on the metrics you care about (not just overall performance metrics but can also be other metrics like latency, inference cost, interpretability, how easy it is to update each model, how likely each model’s performance will improve over time with more data)
chapter 6 as a pretty long section on the framework for comparing 2 different models. here are 6 key points:

  1. Avoid the state-of-the-art trap
  2. Start with the simplest models
  3. Avoid human biases in selecting models
  4. Evaluate good performance now versus good performance later
  5. Evaluate trade-offs
  6. Understand your model’s assumptions
Ramsi Kalia

Chip Huyen Hi Chip, thanks for answering questions here!,
I am wondering how re-usable ML systems are in your opinion? With every new problem statement, the dataset, features, splits, training etc. changes right? I am trying to understand how much value can be generated from building entire systems around ML projects?
I understand that for large industries like Uber etc. if the models are being used daily and retrained very often it would make a lot of sense.
But for smaller businesses, where models are retrained infrequently and there is a larger focus on trying out different techniques and newer models, how should we justify the extra time and effort spent on building systems? And what would the benefit be?
Appreciate any insight you can offer!
Thanks again for your time!

Chip Huyen

i think you answered your own question!
100% agree with you that how much to invest into a reusable ML platform depends on how much you need it (not just today but also in 6 months).
it sounds like you’ve spent a lot of thinking about it. curious what your approach for this i?

Ramsi Kalia

Hi Chip Huyen, I don’t really have a plan of action atm,
I’m working at a startup in the automotive space (we just won the NASSCOM Gamechangers Award in the transport & logistics category , company is Carscan) but we don’t really have full fledged systems for retraining.
There is active learning in place for the primary damage detection model, but even for that we’ve been trying out different methods to improve performance (e.g. SAHI, Autoassign, PAANET etc.)
For edge deployment models, up till now we just trained a bunch of classifiers (cos they’re faster than object detection) and moved them to s3 and work from there.
We’ve had numerous discussions on how to streamline the edge deployment of models for the webapp and we just seem to be going in circles lol.
The last sprint was focussed in edge deployment and we tried out Hydranet, multihead models, mobilenet ssd, multilabel classification to name a few.
With regards to code reusability and experiment/model tracking, I am giving a training session to my team on DVC this coming Monday, (still figuring it out for myself),
however, for designing systems, I think we’re still lost.
An argument could be made in favor if the benefits of taking such a thing on were greater than what I seem to be understanding atm.,
Have you worked with startups before?
Do you have any case studies you’d be open to sharing?

Sandhya G

Chip Huyen,this is an exciting book. How do we determine which problems are worth exploring if ML is a viable solution? Sometimes, ML might add more cost/ complexity than using an expert to solve the problem. For example, experts scan ground scan data to determine where to drill for oil. Using ML here maybe difficult as 1. we do not have a lot of training data 2. Experts maybe using their intuition which may be hard to codify. What is a framework for answering if ML is a viable route with decent chance of success given the cost to develop it. Also, any framework to have ML assisted workflows? Thanks!

Chip Huyen

whoa that does sound like an interesting challenge. how are you doing about approaching this?

Sandhya G

In my workplace, this is mostly based on instinct (experience). Do a pilot, about 6 months, see where we get. We also go for a lot of non ML components for the solution (data management and access, for example) so that customers get guaranteed value.
However I’ve seen this in my learning too. I’d start off with something in mind, but fail to produce a good model. Since this is for learning, I do not know if it is because I am not doing it right or if the problem is not amenable.

Dustin Coates

Chip Huyen thank you for doing this. I’m coming at this from a PM perspective, and one question we have is: how do you show incremental added value in early days of ML projects?

Chip Huyen

oooh this would be a fun discussion. happy to set up a call to discuss more!


Chip Huyen thanks for doing this! My initial 2 questions:

  • What’s the latest industry trend with respect to ML system design? I have heard (and read blogs) that there is increasing online learning & retraining, is this true?
  • What would you recommend as the first set of skill(s) to pick up for a Software Engineer transitioning to an ML Engineer apart from just the ML Theory basics?
Chip Huyen

hi Bharat thanks for stopping by!

  1. i’m very big on online prediction and continual learning, so perhaps i’m biased, but i’ve talked to a LOT of companies interested in online learning & retraining!
  2. hmm this is a hard question as the answer depends on who’s asking. for me, i love databases and devops!
Gur Hevroni

Hi Chip Huyen! Great to have you here 🙂 Your book sounds super interesting!
Do you cover in the book the things you should consider before you start to building/designing anything? E.g. how to approach a problem, how to validate your understanding of the situation and use cases, etc.
Looking forward to learn more about your book 📖

Chip Huyen

Hi Gur, nice to meet you and thanks for your kind word!!
Yep that’s what the first 2 chapters are about!

James Gough

Hi Chip Huyen I have what I think is an easy or at least mundane question for you (or anyone else if they know). 🙂
I want to buy your book but do you know how compatible your book is with Kindle? I know O’Reilly’s subscription service doesn’t provide Kindle-compatible ebooks so wondered if there’s been much QA for the Kindle version. I’ve bought some programming books on Kindle before and the formatting isn’t always the best. Thank you.

Chip Huyen

ooh i’d love to know that too. if anyone does know please lmk. i’ve only looked at the book sample on kindle and it seems fine!!!?

Ashish Lalchandani

Hi Chip Huyen, another question - What are the best practices for ML in production, and as someone who is looking to switch from data engineering to MLOps, what are the bad practices to avoid?

Chip Huyen

Hi Ashish, nice to meet you!
Sooo best / bad practices are going to be a loong conversation – I think it might take a book 😛
Short answer: i wish we have better engineering best practices in MLOps!

Ashish Lalchandani

Thanks Chip Huyen..waiting for you future books then🤪 hehe

Chip Huyen

haha it’s part of this book 😛

Ashish Lalchandani

Oh same book, nice! Will have to wait till i get my copy then😅 🙌


Hi Chip Huyen, thanks for taking the time and providing the offer to answer questions. I’d like to know

  • What’s the approach of the book: rather theoretical explanations to get an overview over the topic or a practical one with hands-on code/ best practice-examples to go through by yourself?
  • Which chapter was the most difficult to write and why?
  • Just out of curiosity: team PyTorch or Keras/Tensorflow?
Chip Huyen

Hi Daniel!

  • I think it’s practical with examples but very little code.
  • They’re all difficult in different ways, but Chapter 3 (data systems), 8 (distribution shift / monitoring), 10 (ml platform) took the longest time!
  • PyTorch & JAX!

Hi Chip Huyen, thank you for taking the time to share your opinions. I’d like to ask about the best practices on improving the ML/DL reference time. And what can we do/consider ahead of time in order to improve the inference time when designing the ML/DL system? Thank you a lot!!

Chip Huyen

Hi JC, quantization seems to be the most common method today!

Gagan M

Can we also add Knowledge Distillation if we are dealing with DL?

Chip Huyen

yes! in the Model Compression section in the book i discussed the pros and cons of distillation vs. other compression methods like quantization

Philip Dießner

Hello Chip Huyen, as is evident by the multitude of questions (Thanks for answering all of them!) your book is definitely hitting a nerve with the community.
Can you talk a bit about how your writing process worked?
What was your motivation in finding the breadth and width of the content?
And as all is now done, would you do it again (e.g. writing another book)?

Chip Huyen

hi Philip, nice to meet you!
the writing process was an iterative one over 4 years with a ton of feedback and restructuring!
i’d love to write another book one day, but not sure the topic yet. would love to hear if you have any topics you’d be interested in learning more about!


Decentralized Machine Learning?


Hi Chip Huyen! Thanks so much for taking the time to answer questions here.
This book sounds super interesting and relevant! I know some of your blog postings have been about real-time ML. To what degree is this covered in the book? (Seems like Ch9 deals with it, at least in part).

Chip Huyen

Hi Allan, you’re right that i’m very excited about real-time ML 😄 in the book, real-time ML appears in multiple parts:

  • data engineering (batch vs. streaming)
  • deployment / prediction service (batch vs. online prediction)
  • real-time observability
  • continual learning

Thanks Chip, yes it seems like real-time ML is really something wanted/needed in industry today. Looking forward to learning more about it! 😀

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.

DataTalks.Club. Hosted on GitHub Pages. We use cookies.