Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Fundamentals of Data Observability

by Andy Petrella

The book of the week from 29 Apr 2024 to 03 May 2024

Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practices that enables data teams to gain greater visibility of data and its usage. If you’re a data engineer, data architect, or machine learning engineer who depends on the quality of your data, this book shows you how to focus on the practical aspects of introducing data observability in your everyday work.

Author Andy Petrella helps you build the right habits to identify and solve data issues, such as data drifts and poor quality, so you can stop their propagation in data applications, pipelines, and analytics. You’ll learn ways to introduce data observability, including setting up a framework for generating and collecting all the information you need.

  • Learn the core principles and benefits of data observability
  • Use data observability to detect, troubleshoot, and prevent data issues
  • Follow the book’s recipes to implement observability in your data projects Use data observability to create a trustworthy communication framework with data consumers
  • Learn how to educate your peers about the benefits of data observability

Questions and Answers

Romuald Tcheutchoua

Thanks for the book!

Romuald Tcheutchoua

Does the book cover ALL the KEY observability needs for the ENTIRE machine learning lifecycle? if yes, please can you share a glimpse of how the book can help at the specific stages of ML lifecycle?

Andy Petrella

Hello Romuald Tcheutchoua, good one.
So data observability has a big role to play in analytics observability (see chapter 1 and 2). They are working hand in hand to produce the needed “features” to represent the system in place (a ML product or so).
A ML lifecycle is likely including a data lifecyle, this is where you want DO to back you up.
Note that the concept introduced in the book are mainly translating concepts from application and infrastructure observability… which therefore can easily translated in analytics/ml one
(I did the exercise but must be continued further… perhaps a future book — but happy if someone writes it for me ahaha)

Romuald Tcheutchoua

Also, there are couple of ML observability platforms (such as Arize, etc.) available out there. what are the innovative aspects of the book that makes it better compared to other available observability resources?

Andy Petrella

A question I would love to answer your own POV on.
An observability platform is complete when you have added all observability components taking care of their relative facet of the system. This is also explained in chapter 1 and 2.
There is not yet 1 observability platform (OSS or not) that does everything… it’ll happen one day probably though 🙂

Nirupama Valluru

Have you advocated for the need for SLAs in the book? If yes, how effective would they be in agile systems.
If no, what is the alternative and how can data reliability be ensured across teams?

Andy Petrella

Hello Nirupama Valluru, nice one!
Advocating is a strong statement but I am covering it yuup.
It is coming after having cover many aspects leading naturally to this concept, but also SLOs and SLIs.
The best I can recommend is to read up to that point rather than me trying to summarize it too shortly here TBH.
It is already in chapter 2, just before talking about garbage in/out. So there is not much to wait before getting to it.
However, I am not covering the “agile” part, IMHO SLA and agile practices are not BFFs 😅. Agile is meant to allow quick deliveries and fail fast and safe, which are concepts covered also in the book (page 151, chapter 6).
Don’t get me wrong, I don’t mean SLA doesn’t fit an agile team, I mean that relying “only” on SLA is not possible in an agile system (defining an SLA is a long process postponing delivery).
I am advocating more SLO and SLI in fact, which are closer to data observability than SLA is to data quality.

Nirupama Valluru

Okay
Thank you for the detailed response

Edwin Chuy (edchuy)

How does data observability fit within a data governance framework?

Andy Petrella

Hello Edwin Chuy (edchuy), neat question 🙂.
Some other members asked me corollary questions, and I can complement my previous answers with an additional info that might be close to your specific question.
Acutally, data observability is working hand in hand with data governance - while one describes policies and laws to be known and embraced, the other helps automating associated controls and ensuring they are implemented.
In a way, DG is the governement, DO are the sensors (eg cameras, …) to monitor, judge and cops are engineers, stewards, etc 🙂
Here is attached a screenshot showing where DO fits in the DAMA Wheel

Edwin Chuy (edchuy)

Andy Petrella Thanks for the answer. I somehow knew that the answer would fit within the DAMA framework.

Manoj

How different is data observability for different systems (like analytics, databases, ml models, llms etc) are there any common things/framework we can apply to all?

Andy Petrella

Hello Manoj thanks for the question, I would recommend you to check my answer to Romuald Tcheutchoua’s questions as it is very similar.
However, that gives me another to emphasize again that there are several observability areas: data and analytics (ML) are two of them, others are application and infrastructure.
All areas have some common concepts, roles, and concepts, but each is also coming with its own set of capabilities which are distinguishing one another.
A system is, in fact, observable if it has all areas of observability enabled (see chapter 1 and 2).
You don’t need either one (ml obs) or another (data obs), but all of them due to the heterogeneity of data systems (which cannot be avoided as this is intrinsic and vital)

Manoj

Thanks Andy Petrella

Arjun

Hii all Good day, Can Anyone help me to how to get an copy of the book Fundamentals of Data Observability!
Thanks

Andy Petrella

Hey Arjun you can get a free e-copy on the Kensu website: https://www.kensu.io/oreilly-all-chapters
Otherwise, it is available ofc on O’Reilly platform and Amazon

Andy Petrella

Hey @everyone! I’m happy to start the week with interacting with this great community about my book - big up to Alexey Grigorev for having me.
FYI, you can download freely FODO on this page https://www.kensu.io/oreilly-all-chapters.
If this can make you save some bucks 😅 .
Looking forward to your questions and chatting with you.
Have a great week ahead!

H.Sajid

Hello everyone! Andy Petrella does the book address how to seamlessly integrate data observability into existing data workflows and processes?

Andy Petrella

Hello H.Sajid ! Yes definitely there is a chapter that integrates all concepts and methods, the chapter 7: integrating data observability in your data stack (50 pages).
Chapter 8 provides other additional keys for other systems

H.Sajid

Thank you for your response and the book.

Luis Oliveira

Hello everyone.
Thank you for this book Andy Petrella 🙂
My question is regarding data observability in the big world of data governance.
How far can a team implement data observability processes if the company as no data governance structure implemented?
Can a data team start with data observability processes on its own?

Andy Petrella

Hello Luis Oliveira good ones!
Data observability can be initiated without data governance in place, at all.
Of course, I am not recommending to not have DG 😄 (see my other report on the topic: What Is Data Governance? Understanding the Business Impact).
Data observability is a practice dedicated to give the processes (applications, pipelines, ml, …) manipulating data to ability to expose information about their behaviour (usage, purpose, patterns, …).
Using this information (data observations) doesn’t necessarily rely on data governance because they are used primarily by all data users to have visibility about what’s happening with data at any time.
However, having DG in place, allows DO to be leveraged in more standard ways and to use the data observations to “control” data governance policies are followed, and, conversely, allow DG policies to evolve (given data observations are facts from the fields and DG can adapt over time to that).
You can make a parallel with other observability: using prometheus for example doesn’t require to have IT Governance in place (think about startups). It doesn’t prevent it to be useful for the whole team/company (by transivity), but its value benefits with governance practices — ultimately, support teams, SREs, …
Alternatively, we can think like this:
How much do we need to be convinced or do we need to be forced to produce logs about data schema, data metrics, and even some data lineages?
Then, how often do we think about it?
Then, how much governance policies will make us do it?
Then, how much enforced policies will enable them?

Tim Makhambetov

Thanks for the book Andy Petrella! Couple of questions:

  1. What are the approaches to get a buy in from the management and stakeholders on implementing robust data observability in the organization?
Andy Petrella

Hey Tim Makhambetov cool, thanks for the qs!
Getting a buy in for data observability is similar to get a buy in for other type of observability (eg prometheus, datadog, and other folks of the kind).
It used to be hard, because the “value” is hard to demonstrate (esp. for non tech savvy ppl) - and also, the value is hard to track or measure (or there is no baseline yet):

  • the number of errors anticipated,
  • the amount of processes trusted by automation vs team work,
  • the satisfaction of users (and engineers…)
  • the management of unknown-unknowns
    And the list goes on.
    A mistake often made is to use “the quality” as one of the goal to be reached with observability.
    In fact, observability (data in this case) is to avoid “quality” issues to propagate and to become huge problems due to the amount of time and efforts needed to understand each issue.
    Avoiding all quality issues can be made with tests and pre-requesites and contract (prepping my next answer 😅) and stuff alike, those can take a lot of time and perhaps infinite (if you want to avoid “all” issues). So best is to find a limit (or just abandon it to go to prod aha) and then… what?
    Then data observability. To support all other cases.
    But also the cases that were predefined above, because data observability will provide information about their occurences.
    So how to get buy-in from stakeholders?
    By experience, it is sold as an insurance, a backup, because the team cannot humanly support efficiently (I mean people and time, hence 💸) all data product/pipeline/… at maximum satisfaction. Because negociating quality in data is a matter of push back and throwing the ball between producers and users.
    Teams know that problems * will * happen, but users are not accepting them because they have not been told problems * will * happen. So it is inacceptable when any little thing happens.
    Teams don’t tell anything about this fact to users, they are scared to do it because they can’t say “yes, there will be problems, but we’re ready, so don’t worry”.
    Data observability is allowing teams to tell their users they can trust them because they are prepared to help them out in case of problems, and over time, avoid them.
    PS: remember how hard it was to sell/convince about application/infrastructure observability back in 2010s :face_with_spiral_eyes:
Tim Makhambetov
  1. What is your view or recommendation on data contracts for the purpose of data observability?
Andy Petrella

Data contracts are a great way to align on many things especially before going production. Esp. if they can be enforced.
Like any contract they can go as deep as any human can think, and take as much time to do so, and thus delay the value generated by the transaction being contractualized.
A contract has its limit, because compromise have limits, and also they are not covering every single possible event that can happen (in fact they can, but how long will it take to “discuss” them -> especially, the associated responsibility and possible “fines”?).
Data contracts are governing the transactions.
Data observability is tracking the executions of transactions.
Sometimes (often!) things are not going the way it was expected, but we have to deal with that, we need to be prepared, we need to find new compromise to allow the relationship to keep going and the transactions to hold (with trust).
That is make data contract evolving.
Hence, in data contracts, you fix rules, with data observability you:

  • validate them
  • update them (yes rules are not necessarily written in stone…)
  • help creating new ones
    So in a data contract, you specify that the processes in place must at least produce some foundational information (as described in ch2 ^^). Contracts can also specify which metrics are mandatory, but I don’t recommend them because it’s not worth it.
    Like most contracts, there is a request for both parties (esp. the service provider in such case) to have an insurance, its terms are relatively vague (and often absurd at first - like a 1B dollar coverage 😅). That’s data observability.
    Note: SLAs are not observability per se, even though they are related.
Luis Oliveira

I have one more question to add 😃
Which are in your opinion the best data observability tools and why?
I use a lot of dbt so I use elementary but IMO the package is full of bugs and has lots to grow.
I also work with databricks for data from raw to silver layer.
Don’t know which tool we should implement but I am thinking on doing a POC for Great Expectations.

Andy Petrella

Ahem, hehe, tricky one as I am also the founder of Kensu…
I won’t give a direct answer to this, as I don’t want this to become a product plug but provide everyone with some insights about “tooling” for data observability.
First of all, there are two parts in data observability:

  1. the data tools must be data observable
  2. the data observations must be leveraged in a data observability platform
    For 1., elementary is a good example of what is called in the book an “agent”, that is a component attached to a tool that will turn it “data observable”. In this case, dbt-core is already partially data observable as it produces manifest and run-results, however they are not easily consumable. Plus, the tests are nice but the tracking is tedious as devs must configure them all. Elementary solves the first part by exposing the data observations (in the json files) in a table, and partially the second, as it generates tests.
    However, elementary is not producing metrics on data, but test-results which only addresses a minor part of what is needed to be covered to leverage appropriately data observability.
    On spark, same thing, an agent (a jar + py module) can be attached to the applications to generate the necessary information to add the data observability capability to all jobs. There are component offering lineage exposure and some schema as well, but there is only one I saw turning spark jobs into fully data observable ones (guess who is publishing it 😅).

    For 2., a platform is needed to aggregate and consolidate information coming from various agents (elementary, spark agent, azure data factory, python script, stored proc, …) to create a holistic view of the processes and also offer capabilities leveraging the “history” of the behavior (a database storing timeseries and a graph at least!).
    This is something that must be separated from the tools (eg, dbt, spark, databricks, snowflake) and the agent (eg, elementary, spline, …) to allow interop and the holistic view mentioned above.
    Such a platform has the primary goals to detect anomalies and categorize which are issues or not, provide notification systems, troubleshooting helpers/companions. These must rely as much as possible on the information available in the platform without necessarily constantly relying on people providing inputs (eg, quality rules), hence statistical analysis (or AI if you want) is a table stake.
    Another important feature is the circuit breaker that allows the platform to act as a control tower for data processes -> breaking pipelines involving non reliable data, avoid GIGO etc.
    I HTH
Luis Oliveira

Oh sorry. I didn’t notice you were the founder of Kensu 😁
I think I understood what you said
In a very simple way:

  • Both dbt and databricks produce information that can be used to observe the data
  • The DO tool should have a platform to organize and present all the results.
    So Elementary is a good agent because it uses dbt core information to organizes and present the results.
    Regarding Great Expectation I don’t know exactly how it works the organization and presenting the results.
    As far as I noticed Kensu is free for connecting to Databricks but if we want to present the data from a report point view then it is paid. Am I correct?
    There is also the first one… Monte Carlo . I have no idea how it works and the pricing.
    Thank you for the great information 🙂
Andy Petrella

Yes, your summary is really close to what I had in mind 😄.
I won’t discuss any of the solutions (non free/oss) here, but happy to take side chats privately as I don’t want to puzzle folks here 😛

Kane Williams

Andy Petrella is there any type of data you find particularly tricky to use some of the practices in your book well with? e.g. timeseries, …

Andy Petrella

Kane Williams not necessarily, because the nature of the data is not the main driver. Of course, when logging metrics, the engineer may want to produce specific metrics such as cardinality of a categorical variable or things like this.
The practice in the book emphasises that to have control of a data system and automate its operations, the information about the data must come from elsewhere than the datastores themselves.
However, the chapter 8 is dedicated to systems that are less easy to integrate data observability into - not necessarily legacy, but opaque (like a cloud/SaaS service for example).
So, graph, timeseries, non-structured, deep, flat, and so on data types are totally OK with the practice introduced in the book 😄

Nirupama Valluru

Thank you for the book Andy Petrella
Are there any common signs or any right time that would indicate it’s the right time for a company to invest in data observability?

Andy Petrella

Hello Nirupama Valluru, thanks for this! It is a good corollary question to Luis Oliveira ’s.
And, unusually 😅, the answer is quite short and straigthforward.
Yes, there are common signs, it is when your data pipeline goes to production and created outcomes that will be used by others (people or systems)

Andy Petrella

This is the time triggering doubts, stress, and anxiety for the people who take this action or responsibility. This is when you want to have a maximum of visbility.
Especially at the beginning. Over time, the team could reduce the visibility as confidence has grown, etc.

Andy Petrella

It’s like when you go to prod the first time, you have logs level DEBUG or INFO, then move to WARNING, ERROR later on

Nirupama Valluru

Thank you for answering!

Andy Petrella

My pleasure Nirupama Valluru

Mitch Edmunds

Andy Petrella I’m interested to know, do authors get to select the bird/animal on the front of their O’Reilly books and if so, what is the bird you selected and why?

Andy Petrella

Mitch Edmunds aha, well, you, as an author, have no power in this process.
There is a strict, secret (no kiddin), process followed internally to decide this (it takes a lot of time).
My ornate hawk-eagle bird was selected because it has a freakin’ good eyesight (8x human’s) :D

Mitch Edmunds

Oh right, ha! I guess they are a very important part of the brand and the illustrations are always beautiful. You seemed to have landed a particularly cool one, I’d never heard of this bird. And obviously very apt.

Andy Petrella

Ahahah, yeah. I loved it right away when they unveiled it to me 😄

Andy Petrella

Thanks @everyone for your questions and big up to the ones running for the book. I’ll turn the ebook price into a hard book one, because you can get a free copy on https://www.kensu.io/oreilly-all-chapters for a few months still.
I am available for further questions obviously! Please connect me on LinkedIn and let’s rock it all 😄
https://www.linkedin.com/in/andypetrella/

Konrad

Thank you kindly for the book

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.