Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Reliable Machine Learning

by Todd Underwood, Kranti K. Parisa, Cathy Chen, Niall Murphy

The book of the week from 05 Dec 2022 to 09 Dec 2022

Whether you’re part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run and establish ML reliably, effectively, and accountably within your organization. You’ll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization.

By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision making, solve problems, or understand and influence customer behavior, you’ll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind.

You’ll examine:

  • What ML is: how it functions and what it relies on
  • Conceptual frameworks for understanding how ML “loops” work
  • How effective productionization can make your ML systems easily monitorable, deployable, and operable
  • Why ML systems make production troubleshooting more difficult, and how to compensate accordingly
  • How ML, product, and production teams can communicate effectively

Questions and Answers

Cyril de Catheu

Hey, thanks for sharing.
How does the book relate to the google SRE books? Should I read it if I’ve read (and applied) those?

Niall Murphy

HI Cyril - to be clear, there’s no “should” here. No requirement to read one or the other before anything. The books speak to each other in the sense that they come from a similar mindset, but they’re actually about pretty different things. (They’re not in the same grouping at the publisher, for example.)
Having said the above, the Google SRE books are a good intro to questions about systems thinking in general, and obviously share some authors in common (though of the two, one has not been in Google for five years).

Cyril de Catheu

Thanks Niall Murphy!

Cyril de Catheu

I find it difficult with ML issues to understand whether an issue is related to infra, modeling, or data.
Is this something discussed in the book?
In a few words do you have strategies to approach this problem?

Niall Murphy

Yeah, good point, that is definitely one of the challenges. We deal with this in the monitoring chapter, though not quite as directly as you put it there.

Cyril de Catheu

Thanks Niall Murphy.

Eunice

Hi, Thank you very much to answer our questions. Do you see any big difference for MLRE process between Deep Learning and Machine Learning Projects ?

Niall Murphy

Nope. I would say it’s broadly unchanged under both methods, except to the extent you can ignore training data for RL processes - if you don’t have training data to manage, a bunch of complexity goes away.

Eunice

A second question, how did you get the excellent idea of writing this book?

Niall Murphy

You’re very kind! We talk about this a bit in https://www.youtube.com/watch?v=ypt62rP3zhg but the basic story is - there’s a lot in the MLOps world that’s a total wild-west situation right now, Todd in particular has some great experience that other people could benefit from, and Todd and I were going to write a book together anyway 😉 But then we realised the actual book to write was one that talked about everything in ML that wasn’t building a model, which includes data, privacy, monitoring, organizational design, etc etc. Those things don’t get talked about in the one book.

Alexey Grigorev

Hi everyone! Thanks for being here!
Do you think ML reliability engineer is already an established role? Or most of the time it’s “usual” SREs working in ML teams?

Cathy Chen (she/her)

From my exposure, it is normally SRE type who have special interest in ML but usually their expertise is in SRE topics, not ML topics.
todd underwood did a few talks at SREcon about this:
https://www.usenix.org/conference/srecon22emea/presentation/underwood
https://www.usenix.org/conference/srecon21/presentation/underwood-sre-ml

Niall Murphy

Speaking personally - I would disconnect from the job title and think about the behaviours. Are there people in your org who are concerned about the reliability of the ML stack? If so, maybe without realising it, they are MLREs.
My intuition today is that there are loads of people who care about the reliability of what they’re doing, but very few of them have the job title MLRE. I suspect those that do, are in places with lots of existing SREs.
I have no opinion about whether or not MLRE will establish itself as a job title, but I think the behaviour of caring about the reliability of your production stack is going to continue indefinitely.

Alexey Grigorev

> How ML, product, and production teams can communicate effectively
Can you give a short summary?
Also, what does each of these teams do? Why do we need 3 teams? And in which case it should be just one team with all these 3 functions inside the team?

Cathy Chen (she/her)

This would really be a great one for Kranti Parisa! I’ll check in with him if he’s doing any better.

Cathy Chen (she/her)

ML engineering team could be that team who creates and trains the models. Thinking your data scientists, with a bit of training and serving expertise. (In case you haven’t read the book: we have an imaginary online shoppe yarnit.ai
For yarnit, this may be the team who is making ml models for suggesting additional purchases on the “view my cart” page.
Product could be the team who owns the overall product or the rest of the product. For yarnit, the shopping cart might be part of the product team.

Cathy Chen (she/her)

ML production is the team who ensues that all the training and serving infrastructure is up and working

Niall Murphy

large company bias, of course, but yes, those functions can all totally co-exist in the same team. in which case the problem isn’t necessarily communication, it’s communication of value of the different types of work, and deciding how to allocate resources towards those types of work.

LOIC EMO SIANI

Hello. After reading the book, can we apply for a MLRE role in a company?

Niall Murphy

me grins

Niall Murphy

I think there’s nothing to stop you applying even before you read the book

LOIC EMO SIANI

Hello. How many pages has this book?

Niall Murphy

I strongly suspect you’d be as capable of establishing that as I am: https://www.amazon.co.uk/dp/1098106229?linkCode=gs2&tag=oreilly20-21

LOIC EMO SIANI

Hello Cathy Chen (she/her) A1IndiaDS Kranti Parisa todd underwood. I have a question. What are the tools used in the book to set up a system to proactively monitor ML models ?

Niall Murphy

We are deliberately “tool agnostic”. We don’t recommend any specific tools (we do mention a few). We focus on general principles instead.
(This might mean we’re not a good book for you, but we think many of the principles and recommendations can be easily translated into whatever tools you happen to be using.)

LOIC EMO SIANI

What are the relevant skills required to be a MLRE ?

Niall Murphy

It’s more of a mindset than a skill, as such, but if I had to pick one, I would say “systems thinking”. This is IMHO the largest distinction between product developers focused on features and … well, more or less anything else, really.
Systems thinkers focus on the whole thing, and think deeply about the interaction between subcomponents of the system. We believe this aligns very well with ML, which almost by its nature is very cross-cutting/horizontal.
In order to develop this skill, I would probably start off by looking at these books: https://www.mostrecommendedbooks.com/lists/best-systems-thinking-books but if you’re looking for something more concrete I can follow up.

Michal

Hello, what are the main differences between your book and Chip Huyen’s Designing Machine Learning Systems?

Cathy Chen (she/her)

todd underwood read Chip’s book 📚 so hopefully can answer this best (Niall Murphy as well?)

Niall Murphy

Yep. In my opinion the differences are:

  • Chip focuses on model creation. We don’t really touch that at all. (And you might ask - an ML book NOT about model creation?? How does that make sense?? To which the answer is…)
  • We focus on “the whole system”, including the organizational, ethical, and technical components of what it means to have ML in your company.
    There’s plenty to learn from both and we both blurbed each other’s books, IIRC, but you might start reading Chip’s book first if you’re an IC (ML) engineer and you care most about making models and the wider context doesn’t matter to you. You might start reading our book first if you’re an IC engineer in DevOps, SRE, a PM/TL, manager or VP, and you care about the wider context for some reason.
Michal

How many code examples does your book include? Is it primarily focused on Google Cloud / TensorFlow, or is it framework agnostic?

Cathy Chen (she/her)

We don’t have code examples as we tried to stay agnostic of frameworks. We did have some reviewers from Google Cloud since three of the authors are Google employees, but that was mainly to ensure we didn’t do anything counter to their product direction.

Michal

What’s the difference between MLOps engineer and ML SRE?

Cathy Chen (she/her)

This really depends but what we are finding is that AIOps usually refers to using AI to automate operations and ML Ops refers to the teams or people who operate the ML system/pipelines. Our definition of ML SRE is aligned with that definition of ML Ops.
Also, todd underwood was in a panel at SRECon 2021 where a variety of folks discussed OpML https://www.usenix.org/conference/srecon21/presentation/panel-opml

Niall Murphy

There isn’t a huge distinction IMHO, and whatever distinction exists right now, the linguistic meat grinder of the industry will mess up over time 🙂

Michal

Thank you for your answers x3 🙂

Carlos Pumar

hi! While reading thru the table of contents of your book, I wondered:

  1. Organisational aspects:
    a. assuming a firm has interest in creating internal capabilities for creating data-driven processes (e.g. creating a team of specialists which collect data, select models, deploy and maintain them):
    i. what could be criteria for deciding between investing in (re-)training their existing staff vs. hiring specialists from “outside”?
    ii. what measures could be undertaken to minimise the risk of people leaving and potentially creating a “hole” (in terms of knowledge management)
    b. assuming a firm has not yet decided between creating internal capacities, vs. leveraging from services from third institutions (“SaaS”):
    i. what could be criteria for deciding between these options?
    c. how could, in your view, one argument in favour of using open-source libraries in a firm, towards which whose decision-makers might be skeptical to (mainly for security issues)?
  2. Model deployment and maintenance:
    a. even though I lack of practical experience, it seems to me that “batch training” is the default methodology for training models (as opposed to the alternative of “online training”)
    i. if this is true: what are the main reasons that “online training” has not established itself as much as batch, especially vis-à-vis the risk of “data drift” (which online training systems don’t suffer of)
    ii. what are good practices for minimising the risk of data leakage (in the training phase) so that model performance doesn’t get affected in production?
  3. Does your book cover aspects regarding regulatory issues as to how to manage data (e.g. often financial institutions are strongly regulated by norms defined by regulatory bodies, which limit financial institutions in their data-management decisions (e.g. model deployment on the cloud))?
    Many thanks in advance!
Niall Murphy

Cathy Chen (she/her) would probably have some ideas for a lot of this, but I’ll do what I can.
1ai: time, throughput, and feasibility, as always. do you have a specific deadline for doing something which is infeasible with current staff? there’s a longer discussion that can be had here
1aii: the best ‘conversion’ project I know of (netops folks to SREs) promised and actually gave transition/education support, had a reasonable timeframe (~18 months) and strong management support. in general many folks are up for expanding their capabilities as long as they’re treated well.
1bi: the classic build-vs-buy (see e.g. https://rootly.com/buy-vs-build) discussion generally ends up dividing things like this: ‘if it is core to your business or feasibly could become so in a reasonable time-frame, build it. otherwise buy it.’
1c: OSS and security isn’t really my thing but see e.g. https://www.zdnet.com/article/is-open-source-as-proprietary-software-these-tech-chiefs-think-it-is/ - if this is specifically within the context of ML i would probably argue that right now the whole domain is so wild-west that there is no significant quality difference between enterprise software and startup code and OSS.
2ai: (i call “online training” “continuous training” instead.) they are both very popular in different contexts, but “the large companies” have the resources to make continuous training a practical thing. to my mind continuous training absolutely has risk of data drift just like batch. it’s just a question of the periods between the batches.
(short answer: resources IMHO)
2aii: not sure what you mean by ‘data leakage’ - can you expand?
3: it mentions regulation in the ethics & privacy chapter, mostly in the context of motivating/illustrating various privacy techniques. many of the things you do in order to preserve privacy, balance training data, etc, also end up being the things you’d do to align with e.g. financial regulation. we don’t discuss financial regulation in detail. off-hand comment based on experience: many (not all) assertions about “oh we can’t do X because the regulator forbids it” actually is considerably more nuanced than that if you actually got to talk to a regulator.

Carlos Pumar

thank you for your detailed response Niall Murphy!
re “data leakage”: I wonder if there are practices in order to avoid that models have access to information during training phase, which they would not have when tested on new, unseen data. As an example, you would want to combine cross-validation with pipelines (sklearn package) in order to avoid this leakage…
I was just wondering if there are any other standard measures in practice which avoid the model from picking up information related to the labels at training time - and if your book covers (parts) of these measures. thx again

Niall Murphy

Partitioning training data is not an inherently difficult act, but it does require discipline about what you make accessible to the training process (and what you don’t). In the situation where leakage is a threat, I’m wondering what is the information that’s accessible that shouldn’t be? Are we talking about labels being accessible when they shouldn’t be, or labels being present at all?

Carlos Pumar

sorry for being imprecise: I actually meant labels being accessible when they shouldn’t be, i.e. patterns of labels are accessible to model during training - making the model to potentially underperform with new data.
I am particularly interested in understanding how to avoid this sort of leakage so that performance is satisfactory in production…and just curious to know if there are any “best practices” during the workflow prior to model deployment…thx again!

Cathy Chen (she/her)

For 1….
Go ahead and skim chapters on org design (13-14) because we try to give a framework that you can use to work this out. Based on structure decisions, you need to then think about how to accomplish the best for end-to-end product reliability by adapting process, metrics, people mindsets, and rewards…
1i. Depends on what you’re looking to do. Often a combination of both is a useful thing as long as you have the right mindset in mind.
1ii. Always be training your replacement? I like the book “Influencer” - there a chapter about a restaurant in SF where every person is trained by the last person who started. Everyone eventually does tasks, trains others, mentors others. Of course there’s also the need to specialize. We specialize on projects across regions but generally have everyone at a baseline knowledge to be able to be oncall
1.b-c sorry don’t know

  1. Sorry not my area of expertise
  2. Not that I know of.
Marc

What would be a book to read before and after reading your book?

Niall Murphy

if you don’t have an SRE background, you might like to get started with one of the big two beforehand.
afterwards? maybe [https://www.oreilly.com/library/view/practical-fairness/9781492075721/?_gl=195hemv_gaMTA2ODM2NTQzNi4xNjU1NjQ3NTg4_ga_092EL089CHMTY3MDI2NTc4Ny4zLjEuMTY3MDI2NTg2NS41Ny4wLjA](https://www.oreilly.com/library/view/practical-fairness/9781492075721/?_gl=195hemv_gaMTA2ODM2NTQzNi4xNjU1NjQ3NTg4_ga_092EL089CHMTY3MDI2NTc4Ny4zLjEuMTY3MDI2NTg2NS41Ny4wLjA). if you want to investigate fairness more.
otherwise, uh, i dunno. you’ll probably be quite tired. calvin & hobbes maybe?

Cathy Chen (she/her)

+1

Niall Murphy

(the big two being “the SRE book” and “the SRE workbook”)

David Braslow

As a data scientist who is used to working in notebooks and is just starting to think about how to deploy models reliably at my company, what would be the top 1-3 things you would suggest focusing on?

Niall Murphy

i think others would be better at answering this than me, particularly Kranti Parisa, but i would probably start by thinking deeply about what the non-notebook workflow is gonna look like. they’re terrible for reproducibility imho

Shiro Kulatilake

When looking at the entire ML Loop what is the org structure that works best - having the ML model work being done by a separate team and then integrated into the apps or having a mix of people in various stages of the entire loop ?

Niall Murphy

Hi Shiro: speaking for myself, there is no answer to this question. There isn’t a “best”. Everything, especially organizational decisions, are contextual. I will say that centralised models are better in concentrating expertise and improving domain performance, but can often end up being unresponsive to business needs. Distributed models are better at responding to the needs of the team, but can have staff expertise/retention/etc problems.
Decide what your org is weak in and what needs to be fixed, then take the appropriate action…

Cathy Chen (she/her)

I would say that the major thesis of the book is that we posit that it is better if people across the organization know enough about ML (and not having ML be completely silo’ed). And agree there is no best. Every org design has issues and is not always practical or possible. Chapter 14 covers some practices to think about based on what the org structure looks like.

Shiro Kulatilake

Thank you for your responses !

Cathy Chen (she/her)

Apologies that Kranti Parisa is out sick this week. He will try to join late in the week.

Alexey Grigorev

Oh no, I hope he gets better soon!

Alexey Grigorev

What are the most important SRE principles that ML projects should adopt?

Niall Murphy

Gosh. There’s a lot to choose from. I think… well, measuring things is important, and is an SRE principle, but I think model development understands that quite well.
(Sometimes there are question marks around continuing that approach to measurement in production as opposed to the development cycle, but broadly speaking I think we’re covered here.)
Maybe eliminating toil? https://sre.google/sre-book/eliminating-toil/ I suspect there’s still quite a bit of toil around a lot of what ML folks do.
Defense in depth is another good one. Our serving and training systems can be quite brittle; putting some work into having algorithmic fallback can be really useful.

Ixchel Garcia

hi everyone, I wanted to ask recommendations on books about Data Engineering, ML engineering and data infrastructure. I would deeply appreciate if you could give me a top 3 choices please. I want to specialize as a DS in the engineering behind data and ML.

Niall Murphy

Hi Ixchel - I’m not sure, but I think this channel is for questions related to the book of the week, rather than books in general?

Ixchel Garcia

Oh sorry I guesses this was the most appropriate channel of all hahaha :face_with_peeking_eye:

Niall Murphy

I don’t know for sure, but hopefully someone who knows the Slack best will comment.

Alexey Grigorev

We have invited authors of books on these topics in the past, you can check them in the archives
https://datatalks.club/books.html
And for most of the books we also have the answers from the authors there

LOIC EMO SIANI

Hello. Is there any prerequisites we must have to read the book if we already have mlops background ?

Niall Murphy

I can’t think of any, no. (And it was written to be readable by a very wide audience.)

Cathy Chen (she/her)

Definitely no prerequisites.

LOIC EMO SIANI

ok thanks.

Ataliba Miguel

At what stage of the project should one plan/think to start implementing MLRE?

Niall Murphy

Some signals to think about:

  • do you believe you’re going to rapidly scale, in which case “getting out in front of it” might be a good idea
  • are you currently experiencing reliability problems
  • how much time do model developers spend fighting with infra
    In general, by the time the argument is unassailably clear for doing reliability work, you would have saved some time/money/effort starting about six months beforehand 😉

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.