MLOps Zoomcamp: Free MLOps course. Register here!

DataTalks.Club

Data Observability: The Next Frontier of Data Engineering

Season 3, episode 3 of the DataTalks.Club podcast with Barr Moses

Links:

Did you like this episode? Check other episodes of the podcast, and register for new events.

Transcript

Alexey: This week we will talk about data observability and we have a special guest today, Barr. Barr is a CEO and co-founder of Monte Carlo which is a data reliability company. She has experience with building data and analytics teams, working as a management consultant, doing research as a research assistant and even working in Israeli air force. Welcome. (1:48)

Barr: Hi! Thanks for having me. It is great to be here. (2:18)

Barr’s background

Alexey: Thanks for coming. Before we go into our main topic of data observability, maybe we can talk a bit about your background. Can you tell us about your career journey so far? (2:20)

Barr: As you mentioned I was actually born and raised in Israel. I started my career in the Israeli air force, I was commander over a data analyst unit. I moved to the bay area about a decade ago, so I am located in San Francisco in California. I studied math and statistics, that is my background in data as well. And working as a management consultant, I worked a lot with data science teams, working with fortune 500 companies under strategy and operations. (2:33)

Barr: Then most recently I joined a company called GainSight — a customer data platform, which created the customer success category. At GainSight I actually built and led the team that was responsible for our customer data. So GainSight on game sites, we called it “GONG” for short — that was our nickname for it. We helped our customers use data to grow their businesses with their customers. Throughout that experience, leading the team that are responsible for customer data and analytics, I realized how big of an issue is… some of the very fundamentals around data. As companies try to become data driven and rely on data, they actually run into what I call “data downtime”. We will talk a little bit about that later, but that is when I first encountered it.

Barr: That got me thinking how is it that we are so advanced in data and yet there are some things that we have not figured out. So I started Monte Carlo with the goal of helping organizations use data, adopt data, become data-driven by minimizing what I think is the biggest problem — the biggest hurdle which is data downtime. It is a big pleasure for us to work with some amazing companies on helping them solve this problem and helping achieve data reliability.

Market gaps in data reliability

Alexey: You were leading analytics teams and you were working closely with data and you noticed, “okay we have some ideas how to process this data but when something breaks then things go wrong”. That led you to realizing, okay there is a gap in the market that you could fill. This is why you created the company. Right? (4:35)

Barr: Let me describe to you a scenario that I am sure is familiar to anyone data and potentially in engineering as well. It is always like on a Friday evening, at 6:00 p.m., five minutes before you are just about to log out, something hits you. Like a customer reaches out and says “hey the data here looks really wrong, what is going on”. You are literally just leaving the office, just about to sign off, and then suddenly something blows up. Or five minutes before a really important board meeting, the CEO pings you and says “hey the graph here, something with the numbers that I am showing just looks off, what is going on”. Then it starts to scramble of what went wrong and where the report is refreshed. Did all the data arrive? Did someone make a schema change somewhere, that messed everything downstream? (5:00)

Barr: It started this guessing game of what is going on. So there are a few problems. One — data teams are often the last to know about these issues. They often find out about these problems from consumers of data: executives or business units or consumers or actually users of your product. Second, it often takes ages to understand what the problem is and identify the root cause. Our systems are so complex today that the ability to pinpoint the root causes is extremely complicated, especially when done in a manual way.

Barr: Seeing this problem come up again and again both for myself but also for data teams with other organizations. I was like, “Are we crazy? Am I crazy? How is it that this problem exists? How can it be that?” We do not have a better way to do this and that. Inspired by the realization that there is a better way to do it — and the better way to do it is actually based on best practices from engineering.

Observability in engineering

Alexey: Speaking about best practices. I did a bit of googling before our talk today. Data observability is based on “observability”, which is a concept from the devops world, right? So, before we go into data observability, can you tell us what it means in the devops world? What are the best practices you are talking about? (6:56)

Barr: In general the data space has been evolving very quickly. We are still quite behind in terms of methodologies and frameworks compared to engineering. It is actually worthwhile spending time understanding what are these concepts like devops and others in software engineering that can help us navigate the space of data. Navigate what we want to accomplish in a better way. The idea of devops has emerged in the last couple of decades. The underlying tech stack became way more complicated, similarly to how they are in data. For example, for an organization that is moving from a monolith to a microservice architecture — something that almost every organization is doing. As a result of that, there has been the rise of devops — teams that help, have a constant pulse on the health of their systems and make sure that all the applications and infrastructure are up and running. (7:20)

Barr: As part of that Devops — the idea of observability. Observability is this holistic view that includes monitoring, tracking, triaging of incidents to prevent downtime of those systems. So, at its core, observability in engineering is broken into three major pillars, metrics, logs and traces. All of these together help us understand what is the health of the system based on its outputs. When things are wrong, why? So, answering basic questions like, is a system healthy. If not, what happened? When did it happen? Are there other events that are correlated that could help us understand what is happening here? So you have systems and software to help address the need for observability and monitoring. Every engineering team today that respects itself has something to manage that.

Barr: A solution like NewRelic or DataDog or AppDynamics or PagerDuty — all very familiar solutions that help us answer these questions when it comes to applications and infrastructure. So it is a very important concept in software engineering and one that has been relied on for many years now.

Data downtime

Alexey: You said teams were moving from monoliths to micro services. Usually micro services are some sort of web services. Tools like DataDog, NewRelic or even open source tools like Prometheus and grafana — they are tailored to these kinds of applications, to web services, something that is always running. While in the data world, we often have something different, we more often have batch processes rather than something that is up and running all the time. That is why we need a bit of a different approach, right? (9:49)

Barr: Exactly. It is a different approach but it is very important to do. Let us talk about why it is very important. If you think about — I will draw the analogy of what we call “data downtime”, explain what data downtime and what application downtime is. If we take a specific example. If a company — say an e-commerce company — has a particular website. A couple of decades ago if your website was down, no one noticed because you probably had a real shop where people actually purchased things. (10:28)

Barr: So the website was something minor that nobody cared about. But today if your website is down, it is your product. You have to manage that very carefully. You have a commitment to like 99.99 percent uptime. Today you have all these solutions and many others like you mentioned to actually make sure that your website is always up and running.

Barr: If you think about the corollary of that to data, as you mentioned. Maybe a couple of years ago, maybe like 10 years or five years ago — who is using your data? There is only a small handful of people. They were using data very infrequently, maybe only once a quarter to report numbers to the street. But in today's world, there is way more data engineers, data analysts, data scientists. There is way more people in the organization who are using data to make decisions to power your product and so. If the data is down, that is a big deal. Maybe 10 years ago it was not a big deal because no one uses it. But today it is a big deal. (Stream here goes down =))

Alexey: To be honest I do not remember what you were talking about. You were saying that data downtime is a big deal. While previously nobody cared about this, today there are data analysts, data scientists and everyone else who is using data. We rely on this data. And if the data is not there — I know, I work as a data analyst [correction: data scientist], we build machine learning things on top of data. When the data is not there, when our model stopped working, then we started “okay what happened?”. Then we see the data is not there. Or it should be 1 million records but today we only have 10000. Where is the rest? (12:11)

Alexey: These failures, they are often silent failures. If the data did not appear in all volume, like maybe just a fraction of it appeared and if we do not have monitoring to our machine learning pipeline, it looks okay. There is some data, let us apply the model to this data and I am done. But it is silent, we did not notice that something is wrong. I think you were talking about something like that?

Barr: That is exactly right. The job is completed; it is all green, everything passed. But you know? You only got a small fraction of the rows that you are hoping to get. Or you know what job completed, you got all the data but now all the values are NULL suddenly. Or you got all the data but it is credit card data and suddenly you have values that you do not expect, like letters or something that you should not have there. But you never knew because the job was completed and everything was fine. You might just not know about it. If you are lucky, you might find out about it the same day. But oftentimes it can take weeks or even months until you realize that. Yeah, my model was operating on completely wrong data, or I was using data to make decisions that were just totally incomplete or actually wrong. (13:40)

Data quality problems and the five pillars of data observability

Alexey: If we talk about the problems we have with data, about data quality problems. First there could be some things that are not supposed to be there, like letters in the numeric fields. Another problem is data is simply incomplete: we do not have all the rows that were supposed to be there, like instead of one million, we have only 100000 for example or even less than that. What are the other problems that we can have? (14:38)

Barr: When we first started to determine what data observability is and what it should mean, one of the common things that we heard was, “every data team is different”. Every data team is unique. My data can break for millions of different reasons. There might not be even a pattern for all this. I actually disagree with that. Before we started the company, we spoke with hundreds of data teams to come up with a database of all the times and all the reasons when your data goes wrong — and why? What is the root cause? What is the symptom? And how did you identify it? (15:09)

Barr: What we have seen is actually there are patterns. There is a coherence set that you could work off, that you could instrument and monitor for. That will help you gain observability. Similarly to how you do that in observability for devops, even though your applications and your infrastructure can break for a million different reasons, there is still those three core pillars that we talked about that help engineers identify when their systems are down and so for data teams what are those, what is that framework, what are those metrics, and so we do, we define five different pillars for this and we believe that.

Alexey: It is the three pillars you mentioned? Metrics, logs and traces. And for data you said there are five, right? (16:29)

Barr: Yes precisely. There are five that we define. We believe that if you monitor all of them, instrument and track those, you will get the same level of confidence in your data. Let us talk about which five of those are. (16:38)

Barr: The first one is freshness. Let us say we have a table that gets updated three times an hour regularly. And then it has not been updated for a full hour. That might be a freshness problem. There are many different ways to think about freshness. Basically it answers the question of “how up-to-date is my data”.

Barr: The second pillar is volume. The volume you shared the example — “I expect one million rows and I am getting only 10000 rows”. This tells us the completeness of our data tables.

Barr: The third pillar is distribution. Distribution looks at the range of data. Let’s say I expect a certain field to be between 5 and 15. And then suddenly I am getting values that are in the hundreds or 200s for example. Or I expect credit card fields suddenly get letters instead of numbers. All of these examples would be under the distribution pillar.

Barr: The fourth pillar is schema. It looks at who makes changes to our data and when. That is both at the table and at the field level. If a table is added, removed, deprecated, if a field is changed in type.

Barr: The fifth pillar is lineage. Lineage is basically a map — an auto discovered or auto reconstructed map of all the dependencies, both upstream and downstream, of your data assets. Lineage helps us answer the question of “if I have a freshness problem in a given table, what downstream assets are impacted by that?” Maybe no one is using that table. So, I do not need to care about that freshness problem. But maybe this actually feeds an important model that someone is using? Or maybe this goes into a report that gets sent to a customer regularly? What are the dependencies? And then similarly what are the upstream root causes for these problems? What may have contributed?

Example: job failing because of a schema change

Alexey: I recalled an example — something we had recently at work. Data changed, schema changed. It was announced of course, in a slack channel. But of course, I have so many slack channels open, and not all of them I read. So, I simply missed that. Then, two weeks later, it stopped working. Then “okay what happened?” Why is the data now in the wrong format? But it was announced! If there was something like this, like a map with downstream dependencies, then it would have been possible to know, “okay this data source is used by this team”, and I would be one of these users of this data. Then maybe for me, I would get a personalized alert that is saying “the data is about to change, you have to take action now, else in two weeks your jobs will fail”. This is, I guess, a good illustration of the lineage pillar, right? (19:10)

Barr: Yeah, that is a perfect illustration of a few. It is both the schema pillar --because there was a change in schema. And it sounds like someone manually notified you on slack — which is very thoughtful — but maybe that should be automated as well. Then yes, lineage — you are spot on with that. It is a great example of how that could have helped. Then you mentioned that it caused the data to just stop arriving. (20:25)

Alexey: It was still arriving, just not in the format my jobs were expecting. (20:52)

Barr: Got it. In that case that might be an example of a distribution type problem, when it is arriving in a different format. The interesting thing with data downtime is that oftentime it includes problems from multiple pillars. Each pillar can have multiple different problems. So when you are thinking about observability and monitoring — all that good stuff — what you need is a system that can detect all of these and can help you automatically draw insight from this. Your example is spot on. I hope that was resolved quickly. (20:56)

Alexey: Yeah it was. I had to stop working on other things and fix that. At least my code complained that “hey something is wrong, this field is not there”... (20:30)

Barr: That is how you found out about it? (21:40)

Alexey: Yeah. It broke — it did not work. But it could have been worse if the script, the job kept running. Then I do not know when I would have noticed that. Maybe one month after that. (21:42)

Alexey: Three pillars of observability (good pipelines and bad data)

Alexey: You also mentioned other things from the DevOps Observability world, like metrics, logs and traces. Do we still care about these things in data observability? (21:57)

Barr: Yeah, it is a great question. I would say we definitely still need to think about them, it is just two different parts of the system. You probably cannot run a very healthy system with only one or without the other. Oftentimes we call it “the good pipelines bad data” problem. You might have really reliable, great pipelines or great systems. You are still tracking the observability from an engineering perspective. But the data itself that is running in your pipeline is inaccurate. That is why you need data observability. (22:19)

Alexey: Okay. The idea here is — we still care about DevOps. We have both observability and data observability. It does not mean that we stop caring about all the other things and care about these five pillars. We still need to be on top of good engineering and DevOps practices. And on top of that, so we need to take the three DevOp Observability pillars to make sure that our systems have the observability from the engineering point of view. But we also need to add the five pillars of Data Observability so we have good pipelines and good data health? (22:57)

Barr: Yes, that was spot-on! We often find that data engineering teams are really busy. They have a lot of things to do. And we see a significant reduction in time, when data observability is used in practice. Things like 120 hours per week on average for a five person team and a reduction of almost 90 percent of data incidents. When you think about when teams practice data observability, you are spot on. We cannot use only one, we have to use both for different parts of our tech stack. (23:42)

Observability vs monitoring

Alexey: I noticed that you often say “observability” and “monitoring”. Is there any difference between these two things or are they synonyms? (24:31)

Barr: Monitoring is a subset of observability. Observability basically tells us, “based on the output of a system, what is the health of that system?” Observability helps us answer questions about the health of a system based on the results that you get in monitoring. Monitoring will tell us “this system is operating as expected and it is healthy”, or “hey there is an outlier or a problem with the system here” — like an indication of a freshness problem. On the other hand, observability will help us answer the question of “why is this all happening”. With monitoring, we will see there is a problem, but with observability we can identify what that problem is, identify the root cause of it. Then also how it can be solved by means of answering what downstream tables rely on that data. It can also tell if you it’s an important table to fix or if it’s not a priority. So you actually need both. I see monitoring as a part of observability that helps us answer these questions. (24:45)

Finding the root cause

Alexey: I am just trying to think of an example. Let us say, there is a job that is producing data and it stopped working. We have monitoring in place that says “We expect that in this table. The data appears three times per hour, but it has been one hour since it appeared and nothing is still there”. Would we get an alert? You get called by PagerDuty “Hey something is wrong with the data!” It does not tell us what went wrong — it just tells us that something is wrong. Then using other tools and other pillars like lineage, how do we figure out what actually went wrong? How does it work? (26:04)

Barr: The other pillars can give us clues and help us understand why this happened and what we can do about it. Let us say we have a strong monitoring system and we get an alert about a freshness problem. Then we look at that table and we see that three other tables downstream also had a problem. We see that those are correlated, they happen sequentially. That can help us give clues around what is the impact of this. To your point, using lineage, we can see that there are a bunch of other tables further downstream that rely on this. They are going to be impacted later today. We need to stop the data from going to those tables or we need to fix this immediately. That gives us clues as to how we can actually solve this. (27:10)

Barr: Another example of understanding why this is happening. Let’s assume we see that there is a freshness problem. We want to look at the query logs, we want to understand who is making updates to this table. I can actually ping the right person or the user of this table to better understand why they are using this table, is it important or not, when are they using it. Actually using metadata about this data, getting data-driven about our data, can help us answer that. The way that I think about observability is — “getting data-driven about data”. That includes knowing when things are wrong like monitoring but also answering a bunch of other questions too.

Who is accountable for data quality? (the RACI framework)

Alexey: This makes me think who should be responsible for it. I imagine this setup — there is some sort of central data platform. There are teams who publish to this platform, there are teams who consume from this platform. I am wondering whose responsibility is to make sure that the data is there. The platform probably can help us with freshness. But if the producers of the data start publishing data with errors, there should be some process that lets them know, “Something is probably wrong”. Should it be the team who implements these checks? Or the platform tells them, “hey something is wrong”. How is it implemented usually? (29:00)

Barr: That very much depends on the maturity and the size of the data organization. When you are a very small company and you have maybe one data engineer — the lone data engineer — he or she is typically responsible for everything. They are going to be setting up the system, receiving the alerts, troubleshooting it, and letting everyone know about it. (29:59)

Barr: In a large company, you might have 30,000 people consuming data. And you have a data team that is several thousands of people. We see that people are moving towards a decentralized model of ownership. You have sub teams of data in the data organization. Those people have ownership of that data. In those cases, organizations actually make data observability and data monitoring self-serve. There is typically a centralized group that is responsible for saying this is “the platform that we should be using for Data Observability”. For each data team, for each sub team, that sub team actually defines how they engage or interact with that platform. That sub team will receive personalized alerts for their data assets only.

Barr: Oftentimes, as companies move from being very small to being very large, on that trajectory, we are seeing different models along that path. I think one of the main questions that people ask us is “who is actually responsible for this?” and “how do we set accountability?”

Barr: One of the ways that we are seeing companies deal with this is with a framework called “RACI”. It helps to determine accountability in organizations. RACI stands for — “R” is “responsible” — the person who is executing on the specific task at hand. “A” is for “accountable” — a person whose neck is on the line. It is their main job to make sure that the people held accountable. “C” is “consulted”, meaning their opinion might count, or you would want to seek their opinion about something. Then “I” is for “informed”, meaning someone needs to know about something.

Barr: For each part in the data lifecycle, you can define who is responsible, accountable, consulted and informed. You can use that to determine what is right for your organization. For example, for specific data observability or data quality problems, we can say that the chief data officer or the CTO is the person who is accountable (“A”). But the person who is actually solving that is the data engineering team. The person or the organization that needs to be informed (meaning “I”) is the data analyst team. They need to know about problems, they are not responsible for the jobs in the pipeline specifically. You can use this framework to help allocate who needs to do what and when, and clarify that. So you are not in a position where there is one person doing all that.

Alexey: I am thinking about what I am doing. I am a data scientist and I build machine learning models using data that other teams produce. There is a team that produces this data source, there’s a team that produces that other data source. What I do is — I join these two data sources and build something on top of that. In this case, I would be informed. If something goes wrong, somebody from these teams will reach out and say “sorry there is some problem with the data”. They are responsible. They are reaching out to me. The accountable person is maybe a product manager in the team or a manager in the team. This person would be doing communication saying “hey sorry, I know you use this data, I know you rely on this data but something happened, a server died, there was a bug and the data is not there, sorry”. The data engineers in the meantime are trying to fix it. It also puts me in the consulted category. I could be a stakeholder, I can say “okay what you promised this data with a delay of one hour but this is not enough, can you make it a bit faster, with smaller delay?” If they care about my opinion, it means I am consulted, right? (33:42)

Barr: That is right, you got it. (35:20)

Service level agreements

Alexey: So what do we do with this? Responsible and accountable — this is the team that actually puts the data. If I am a data scientist, I want to make a model based on some data. I go to this team and ask what are your SLAs — “service level agreements”. Can you promise me that the data will appear there five minutes after the user made an action? We make an agreement between our team and the team of data engineers. Then you said, in small organizations would be one data engineer who is doing everything. But once the company becomes bigger we could have this centralized platform, where teams could define these SLAs — “we promised that the data will not be delayed more than by five minutes”, so the freshment requirement is five minutes. So there is an agreement between us. Then they start pushing the data to the platform. And when something goes wrong, the system alerts them and data engineers fix it. (35:24)

Barr: That is exactly right. (36:49)

Alexey: Do you want to add something there? (36:52)

Barr: I think you did a great job of explaining that. I think that is perfect. (36:55)

Alexey: Yeah. I think you are the podcast guest right now. (37:00)

Barr: I know. I much prefer it when you do such a great job of explaining this. Yeah, that is spot on. In the same way that we have adopted observability as a concept, SLAs is something that is super common in engineering. We have not adopted it in data yet. But we can and it will be helpful for us to have that communication agreement. It is important because it will help your data engineering counterparts know what to focus and what to solve for. (37:04)

Barr: Imagine that they have hundreds of tables that have freshness issues. But there are only 10 of them where they have an SLA — a particular SLA for freshness and timeliness that is like this five-minute window. Then they will know to prioritize those and the rest can wait for later. It provides some agreement between you two. And what actually matters to you and you can give the information of these tables matter more. This data set does not matter at all. So it actually allows us to have better communication — to also not waste our time on things that do not matter.

Inferring the SLAs from the historical data

Alexey: Here there are a few crucial components. The first — we need to have this platform that allows us to define these requirements: we have these expectations for freshness, we have these explanations for volume… Maybe for volume we do not even need to define that? Maybe it should be like “something is wrong because yesterday it was that much but today it’s less”. (38:14)

Barr: Same for freshness, actually. Here is the cool thing — for each of the pillars, there is a component that can be automated to begin with. I believe we have under-invested in automation. Which is ironic. Teams are used to saying “okay I need to define that the volume here needs to be this” or “that the freshness needs to be” to your point. But you have historical data about that. You can infer what the volume that you expect is. You can also infer “okay, this table is being updated three times an hour, so it should be updated three times an hour”. Obviously you can add customization on top of that, so you can say “No, actually I want this data to arrive faster. Can you make sure that happens?” I definitely think that we need to start with a layer of automation. Then add on top of that customization. But I would start with what we already know about the data. (38:43)

Alexey: I imagine that we are not starting from scratch. There are already some processes that push data somewhere, there are processes that read data from somewhere. It is not like we have a blank page, right? There is already something. We can just see “okay usually something appears here within five minutes”. Let’s just use this as a SLA, so we can infer this from the past. But in some cases, we should have a way to overwrite this and say “Let us make sure it appears earlier”. In any way, we need to have this place where it is possible to define this. Then we also need people who take responsibility, they say “We are going to stick to these SLAs and if something is wrong we are going to make sure that we recover as fast as possible”. (39:48)

Barr: Yeah, you can even have a monitor in your (virtual) office that shows, here is the SLAs and here is how well we are doing and we are crushing it. (40:43)

Implementing data observability

Alexey: We have these two things: the platform itself, the tool that lets us define all this. Then we have this framework, RACI, to identify who is responsible, so we have this people aspect. Do we need something else to make it work, to have data observability in place? (41:03)

Barr: I would say that is a good start if you have both of those things. Maybe the additional thing that folks do is they start defining playbooks and run books for what happens in these instances — basically workflows. Let’s say, there was this table that you expected to get updated three times per an hour and it stopped getting updated. Then there is a whole set of things that happens. The first is — you get informed because your model relies on that data. Then the data engineers need to take some actions in order to resolve that. What are those actions? What exactly are they doing? What systems are they using to look at to solve it? Who do they need to know? How do they resolve it? All that stuff in one book. (41:27)

Alexey: Let’s talk about how we can implement this. There is actually a question, but I also wanted to ask this question. The question is: what are some of the good tools in the marketplace that provide a good job in data observability? And I think I know what you are going to say. (42:34)

Barr: You have been doing a great job of that so you can go ahead and answer. (42:53)

Alexey: Monte Carlo? (42:57)

Data downtime maturity curve

Barr: I can provide an answer for what folks are doing if that is helpful. I talked a little bit about the maturity curve — how people manage from a small company where there is one person who is doing everything. Then there is a large company where you might have a decentralized model, you have different ownership. As I mentioned, we talked to hundreds of data teams, and there is a maturity curve for how you deal with data downtime. (43:00)

Barr: In the very early stage you might be in a “reactive phase” where you do not have anything in place and you have disasters all the time. I remember the CEO who told me that — back in the day when there were offices — they would walk around and put sticky notes on reports and saying “this is wrong”, “this is wrong”, “this is wrong”. That is a very reactive state, the first stage is.

Barr: The second stage is when people start thinking about how to solve this in a more proactive way. It’s called a “proactive stage” — when people put in place some basic checks. It can be just counts. I am going to manually select a bunch of tables and make sure that they get a million rows every day — because that is what I expect. Those teams spend a lot of time in retros and post mortems and figuring out what is wrong.

Barr: The third stage is “automated”. They recognize that a manual approach is no longer scalable or effective. They start implementing some solutions — and we can talk about what those are.

Barr: Then the fourth stage is “scalable”. You have companies really invested in both the scalable and automated solution, some of the best in class out there. You can take a look at Netflix. They have written a lot about what they have done for monitoring observability and detecting anomalies.

Barr: There are ranges from things that you can hack together or do on your own where you can basically use SQL or Python or Jupyter notebooks. And actually we put together some tutorials on this. I am happy to share the links after if that is helpful for creating your own monitors. The other thing that you could do is look at specific areas in your pipeline and define specific tests in those areas, like in airflow or something like that. What we are finding is that as data organizations start using their data and getting more serious about data downtime, they need a holistic approach. Whether it is an open source tool or something a bespoke solution that is easy to get up and running with. There needs to be something more holistic than a point solution.

Alexey: You mentioned that you at the beginning can get away with a bunch of counters — just count how many rows appeared each hour. Then at the same time you can check what is the timestamp of the last insert row — you can just look at the max value of this column which will give you the freshness. I guess you can get away with a bunch of things just using plain SQL and a bunch of Python scripts like you said. That would put you to the proactive stage in this maturity curve or already automated? (46:05)

Barr: I would say somewhere between the proactive stage. That is still pretty ad hoc. (46:54)

Monte carlo: data observability solution

Alexey: We need some sort of holistic picture. How do we get it? If all we have is a bunch of ad-hoc stuff put together with a duct tape. It does some sort of alerting already, maybe there is an email to some of the people, hopefully. Or a slack message from a bot. That is already quite good. But how do we go even further — how do we go to the automated phase? (47:00)

Barr: In that case you will need an observability solution. Full disclosure — Monte Carlo — we have built a strong observability platform. That is the core of what we do. Some of the characteristics that are important for a strong observability solution. One it needs to give you end-to-end visibility. It needs to connect to whatever your data stack is, including your data lake, your data warehouse, your ETL and your BI or your machine learning models. If you are just doing row counts and there are only a handful of tables in your data warehouse, it is probably insufficient. You are relying on data arriving on time in other areas and a lot of the data moving in different systems. If I were to choose an observability solution, it would be one that can actually connect to my existing stack end-to-end. It’s important to choose a solution that automatically learns your environment and your data. (47:33)

Barr: We talked earlier about whether you manually define the thresholds or you rely on automation. We are not starting from scratch. Solutions that can do the instrumentation for you and start the monitoring for you, using machine learning models based on historical data, I think that puts you at an advantage.

Barr: Another key point is minimizing false positives. Data teams often have alert fatigue. If you have a system that can take into account not only the data but also metadata and think holistically about this — about the five pillars. Each one is important. It has to include things like lineage. It can help minimize false positives and give you rich context about each of the incidents. Then you know whether you should take action on it.There are certain criteria that you should look for when you are thinking about the data observability solution. That is the way to improve health overall and move up in the maturity curve.

Open source tools

Alexey: Are there some open source tools for that? (49:52)

Barr: There is open source for different specific solutions — for each of the pillars. But not one that does comprehensive for all of those five pillars. (49:56)

Alexey: I think the most difficult thing in my opinion is lineage — defining all these dependencies. Is there a good tool for that? (50:08)

Barr: They are different levels of lineage. For example airflow provides job level lineage. But for table level and field level lineage, something that automatically reconstructs that... I actually have not seen a very strong one. I am curious to hear if there is anything that I haven’t seen. (50:24)

Test-driven development for data

Alexey: We have a couple of questions from the audience. Maybe we can go through for them. RK is asking if any good approaches on test driven development in the data space. And does it [TDD] have anything to do with data observability? (50:52)

Barr: Can you say it again? I don’t think I got the first part. (51:13)

Alexey: There is this test driven development, a way to develop things in engineering. Are there some similar approaches in the data space? And how do we go about testing in data? (51:15)

Barr: I think the ultimate question here is what is the difference between testing and monitoring, and or testing and observability and what does that mean for us? Going back to software engineering and the importance of testing — in software engineering it’s critical. You would be crazy to release something to production without testing it thoroughly. Somehow in data we would actually do that. Somehow we actually don’t always have strong testing in place. So putting in place strong measures for testing too is important. I think you need both. You can get away with just testing. Some of the common pitfalls that we see: teams that think that just testing is sufficient. (51:33)

Barr: The problem with that — you do not know the unknown unknowns. In testing, you need to specify things that might happen but there are always going to be things that you did not pick up on. So monitoring helps make sure that when something happens, you will know about that regardless. I am a strong advocate of both. You can define tests like... For example, if you have a solution like DBT, you can define data quality tests in DBT, to help make sure that you are doing testing properly. I think that is important and another great area that we can adopt from software engineering.

Alexey: These data quality tests, we define some ranges? Or for this input, this is the sort of output we expect? Or how does this work? (53:16)

Barr: Yeah exactly! Like defining manually what you expect out of: making sure that specific values are in specific range, or specific things that you know are often breaking, or incorrect in this data that you wanted to test. Some of the common pitfalls that we see is that, one as I mentioned — there are some unknown unknowns. You do not always know what to test for. Then the other thing is that — it is quite time consuming. There are folks who might define thousands of tests, so for a data engineer to go through that, it is actually quite laborious. So I would think of a strong strategy that would incorporate both testing and monitoring as well to mitigate those. (53:26)

Is data observability cloud agnostic?

Alexey: Thank you. Do you know if the big vendor’s, big clouds, already move into this space of data observability? Or they are still not really focused on that? (54:23)

Barr: At Monte Carlo, we have partnered with Looker which is acquired by Google — so with GCP. We have partnered with PagerDuty, we have partnered with SnowFlake. I think the large vendors or the large cloud providers have noticed that something is important. We hear a lot from them — this is something that comes up a lot for their customers, and something that they want to provide a strong solution for. This is definitely something that is becoming more and more important. (54:37)

Alexey: But they should be agnostic of a cloud vendor. I imagine if a company is on AWS, it is quite a vendor lock. If you want to solve a problem for somebody who is on AWS, then you cannot be cloud agnostic. There are so many things that are specific in AWS, like S3, Athena, all these services. The question is, can it be cloud agnostic or it is difficult? (55:16)

Barr: For us... for our observability platform, we integrate with all cloud data lakes and data warehouses and BI solutions, integrate with everyone that you mentioned, GCP, AWS, and all others as well. I think that is important because a good observability solution will actually connect to your existing stack. You do not have to change to a different provider for that. So yes, it has to be agnostic, in the same way that it is in engineering. For example like Prometheus, Grafana, NewRelic, DataDog… They connect to whatever systems you have. I think it has to be a requirement in the same way. (55:56)

Alexey: And you can also use Kubernetes for your jobs and then it is also cloud agnostic. (56:49)

Barr: That is right. (56:55)

Centralizing data observability

Alexey: Another question from RK. What are your thoughts on centralizing the observability in a distributed environment where we have multiple different data warehouses and data pipelines? (56:57)

Barr: Thank you for the question RK. I’d love to dig in deeper and better understand your environment and see how it can help. It probably requires a deeper dive of your system and your infrastructure to better understand. But, even in an environment where you have distributed ownership, it is important to find a centralized way to define SLAs as an organization. Does data overlap? Is data reliability important to us? Do we care about having trust in the data? That needs to be in some centralized fashion. Also each team can decide that “it is important to us” to make sure that we are delivering reliable data. Each team can define the SLAs for their own organization. (57:13)

Barr: So observability matters regardless of whether you are undistributed or centralized. Providing trust and data is important for everyone, regardless of what your structure is and what kind of data you are dealing with. If you do not have trust in your data — that is the worst thing that can happen to you. If you are producing data that people cannot use and cannot trust, it is the biggest threat to us as an industry. So regardless of the structure, there are different ways to adopt it and to implement it. Regardless of that, data observability should be core to your strategy.

Detecting downstream and upstream data usage

Alexey: How Monte Carlo detects upstream and downstream usage of data? (58:51)

Barr: Again happy to go into more detail if folks want to reach out to me. Feel free to email me, my email is barr@montecarlodata.com or go to our website montecarlodata.com. I am happy to show the link after and get into more details and talk about this. But at a super high level, what Monte Carlo does is — we actually have a data observability platform. It is based around these five pillars. As part of our lineage pillar, we reconstruct both the upstream and downstream dependencies, and reconstruct the lineage for a particular system whether it is your data lake, your data warehouse, your BI. We do that across your systems as well, and we do that automatically. There is no manual input. Happy to go into more detail if you want to reach out directly to me. (59:04)

Alexey: I noticed that you also joined our slack. That’s another way of contacting you. (1:00:05)

Barr: Absolutely! That is a great point. I am a big fan of the community that you are building and I am available on Slack. Happy to take any questions, feel free to send over. (1:00:15)

Bad data vs unusual data

Alexey: Maybe the last one for today. How do you differentiate between getting bad data and getting uncommon data, which might be interesting but not wrong? (1:00:27)

Barr: If I will rephrase that a little bit, I think the question is — I might get notified about something but it is intentional, it is not bad. It’s just different. It’s unexpected but maybe it's a good thing. And you are right, I don’t think we can actually discern that. Also, I am not sure that actually matters. I think people want to know about changes in their data, even if they are simply uncommon. Let’s say you had a crazy spike or you got instead of a one million rows, you got 10000 rows. Maybe that was intentional because someone made a schema change upstream. Maybe that is still good data, it is not bad. But you still want to know about that because that has implications on your machine learning model. (1:00:39)

Barr: I think a good observability solution will let you know about both instances. It will provide enough context about each event so that you can make the decision or someone can make the decision whether this is uncommon or bad. But we need to know about both. And the more context you have about that event, the easier it is to make that discernment and know what are the action items to take based on that.

Alexey: Should we actually get alerts every time there is a suspicious row? Let’s say we have a volume of one million rows. Do we want to get an alert for every unusual one? (1:01:52)

Barr: Probably not. You probably will get alert fatigue. You want a system that is a little bit more sophisticated than that. At Monte Carlo we have invested a lot in making sure that we send alerts for events that matter. If this is a table that is highly used or highly queried or has many dependencies downstream for machine learning models. There could be other instances or other ways to identify whether something is really important. But I would definitely say, you want to be very thoughtful and make sure that you are being alerted and taking action on the things that truly matter to your system. (1:02:06)

Alexey: I think we should be wrapping up. Do you have any last words before we do that? (1:02:50)

Barr: I would just say thank you for the time. I really appreciate it. This is a topic that is near and dear to my heart. If anyone wants to continue the conversation I will be on slack. (1:02:56)

Alexey: Thank you. I will put some contact details, Twitter and LinkedIn in the description. Thanks a lot for joining us today, for sharing your knowledge and experience with us. Thanks everyone on the stream for listening, for tuning in. I wish everyone a great weekend. (1:03:07)

Barr: Have a great weekend. (1:03:31)

Alexey: Nice talking to you. (1:03:33)

Barr: Likewise. (1:03:36)

Alexey: Goodbye. (1:03:37)

Barr: Bye. (1:03:38)

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.