Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Driving Data Quality with Data Contracts

by Andrew Jones

The book of the week from 07 Aug 2023 to 11 Aug 2023

Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value. With Driving Data Quality with Data Contracts, you’ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You’ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best—the data generators—and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products. By the end of this book, you’ll have gained a comprehensive understanding of how data contracts can revolutionize your organization’s data culture and provide a competitive advantage by unlocking the real value within your data.

Questions and Answers

Joe Edgerton

What would you say is the low-hanging fruit that companies/orgs could do to more easily adopt data contracts? I’m interested in learning more about data governance because we don’t have any data experts on my small team and we constantly run into issues that you’ve described. It sounds like chpt. 3 addresses this, and part 2 overall seems to get at the “why/what” for contracts. Thank you!

Andrew Jones

Hey Joe Edgerton, great question!
I’d say the low handing fruit and first step is to get people talking to each other! When there is an upstream change that breaks a data process, set up a meeting to discuss it. Make the engineering team aware of the issues the data team are facing, and the impact it has (or could have, for example if you have data process that are critical to the business).
Those conversations will naturally start leading to discussions around how to prevent these issues. That’s when you can start to introduce data contracts as an idea to consider.
That’s really all it takes! People want to do the right thing for the organisation, but too often data teams suffer in silence, so other parts of the business are just not aware. Data teams also attempt to work around these issues, but that just increases complexity and makes the pipelines even more reliable, as well as more expensive.
A similar approach can be taken to any data governance initiative. Get the right people together, clearly articulate the problems you’re having and the impact or cost it has on the business, and get everyone bought in to finding a solution. That gives everyone a sense of ownership of the problem and incentivises them to help out on a solution.

Joe Edgerton

Thanks!

Hrithik Kumar

Could you elaborate on how establishing data contracts as interfaces can help assign responsibility and accountability for data to its generators? How does this autonomy benefit the data management process?

Andrew Jones

Hey Hrithik Kumar,
Great question, as this is a lot of what data contracts is all about!
There’s a few ways data contracts as interfaces help assign responsibility and accountability for data to its generators.
Simply the act of calling this an interface helps a lot, as software engineers work with interfaces all the time. It automatically implies something that they are providing for a consumer, and that they are committing to maintaining. What we’re saying with data contracts is that it’s no different to an API, a library interface, etc, and you have the same responsibilities for.
But, if you’re responsible for something, you need to have the autonomy to fully manage it, otherwise you wont feel like an owner. I’m sure many of us here has worked in a team where a service was given to us and we were expected to own it, despite having no say it how it was developed and no confidence in making changes to it. It’s not a nice feeling! We don’t feel like an owner, and we’re not motivated to support it.
That’s why data generators have to own the contract. They need to be comfortable with the data they are providing, its structure, its SLOs. They need to have the ability to change those things as requirements or technical limitations change over time.
To support this autonomy we need to provide them with tooling that enables them to create and modify data contracts without review/gatekeeping/approval from a central team slowing them down. Ideally, that tooling will be as good as the tooling they use to build an API, or spin up a database on a cloud, etc. Most of us aren’t there yet, but in a few years this will be the norm.
Hope this helps! I go it to this in a lot of detail in the book, and the word autonomy might be the most common word of the book! But do let me know if you have any follow up questions and I’ll be happy to answer 🙂

Hrithik Kumar

Thank you for the detailed response. I have a follow up question on that.
Could you share insights on how organizations can transition towards embracing data contracts and the potential challenges associated with this transformation?

Andrew Jones

Sure!
The answer I gave to the question above on low hanging fruit has good advice on how to get started: https://datatalks-club.slack.com/archives/C01H403LKG8/p1691593111422599?thread_ts=1691581909.592479&cid=C01H403LKG8
Basically, identify where they add most value and start there. Build the minimal required tooling to support that. Once you’ve provided the value, roll it out some more and keep iterating on the tooling.
Alternatively, you could attempt a large scale migration to data contracts, but there’s a lot of potential challenges there (speaking from experience 😅). You’d need strong exec buy-in, which might not be so difficult if your organisation has a strategy that depends on quality data, and if you can articulate how data contracts helps with that. But, even with that you’re going to need to have many teams prioritise a lot of work, and there’s going to be push back. They have other things to do, and their incentives may favour prioritising that work instead.
So I’d recommend the first, gradual approach. You’re not just rolling out some new tooling - your changing your data culture to one where data generators take more responsibility and provide quality data to consumers in order to drive business value from data. That’s going to take some time! So be patient, target high value use cases and projects, and keep at it 🙂

Hrithik Kumar

Really appreciate your response.

Hrithik Kumar

How do you see the concept of data contracts adapting to emerging trends in data privacy and regulations, such as GDPR(General Data Protection Regulation) and data sovereignty? Are there specific considerations that organizations need to keep in mind when implementing data contracts in such a regulatory landscape?

Andrew Jones

That’s an very important question. Regulation in tech in general, and data in particular, is only going to get stronger, so it’s something we should all be thinking about more!
With data contracts, we have the perfect place to categorise and define other metadata about the data. For example, in our implementation at GoCardless each field definition contains metadata telling us the type of data, whether it is personal data, an identifier, and how to anonymise it:
field(<br /> 'bank_account_id',<br /> 'Unique identifier for a specific bank account, following the standard GC ID format.',<br /> data_types.string,<br /> field_category.gocardless_internal,<br /> is_personal_data.yes,<br /> personal_data_identifier.indirect,<br /> field_anonymisation_strategy.none,<br /> required=true,<br /> ),
That then makes it very easy to build tooling that can automate your compliance to the regulations. In our case, we have a data handling service that takes care of anonymisation and deletion as required by GDPR. We also use this metadata to automate access controls. In future, we could extend this metadata to help us build tooling to ensure data sovereignty, define how we process the data if it’s used for AI training, etc.
The great thing about this tooling is it doesn’t care how the data is structured. The schema could be wide with lots of fields, or deep with lots of nesting - it doesn’t matter. The data generator has full autonomy over that, and as long as they categorise the data correctly, our tooling with work with that data and do the right thing to ensure compliance, without the data generator becoming an expert in GDPR!

Hrithik Kumar

Thanks Andrew Jones for sharing your thoughts on this topic.

Toxicafunk

What are the most common mistakes you’ve seen even defining data contracts? What would be the top 3 or 5 things to avoid?

Andrew Jones

Hey, good question!
I’d say:

  1. Focussing on the tech but not the people/culture side. It’s really all about the people and changing how we do data, and the tech is there to support that
  2. Trying to apply data contracts to everything, all at once! You need to manage that migration, focussing on where the most value is first
  3. Making it too complicated. It’s a very simple idea really, so don’t over complicate it! Build the appropriate amount of tooling, automate where you can, put in place just enough process, but no more.
    Hope that helps 🙂
Toxicafunk

Really helpful yes, thanks for being here!

Sven

Thanks!
What are good tools or tech-stack to implement Data Contracts? And do we want the Ownership on the IT or Biz side? For example a yml file in a git repo vs nice UI to change a schema.

Andrew Jones

Hey Sven, thanks for the question!
I like implementing data contracts at the infrastructure layer. That allows you to guarantee that (say) your BigQuery table will always match the contract, as it is generated from it. There’s a sample implementation in my book showing data contracts implemented using Pulumi, an open source infrastructure as code tool.
The tooling is best owned by a dedicated infrastructure or data infrastructure team, and should be designed so it allows the contracts to be owned by whichever team is generating the data.
Often that is a product engineering team, so the ability to define the contract in yml in a git repo is a good option, but more generally, it should be where it would most expect to be. For example, at GoCardless where I work we used Jsonnet rather than yml, as all our other infrastructure is defined in Jsonnet, so it just made sense to be consistent. But there is nothing about Jsonnet that makes it better than yml for defining data contracts, so I wouldn’t recommend it to an organisation that doesn’t already use it.
If you’re targeting non-engineering teams, then a nice UI might make more sense. I haven’t heard of anyone building those internally in their organisations as yet, but some offerings that do offer a nice UI are https://datamesh-manager.com and https://docs.soda.io/soda-cloud/agreements.html.
Hope that makes sense 🙂 Let me know if you have any follow up questions or would like something clarified!

Sven

Wow, thanks for the detailed answer! I’ll have a look into this tools!

Sandeep

Does the book talk about the local development loop for the engineer ?
Also how do separate teams test out changes before pushing to prod ?
Thanks for your time on these questions :)

Andrew Jones

Hey Sandeep, good questions!
I do a little, but at a higher level and not in loads of detail.
I talk about how CI checks can help, how to use tools like JSON Schema and make use of existing open source libraries. I also talk about managing schema evolution in quite a bit of detail.
But mostly, I’m talking about the patterns that help you implement data contracts in your organisation, rather than exactly how to do so in any particular language/service.

Sandeep

Thank you
The schema evolution and the patterns seem super cool

Sandeep

More general question, do these contracts stay in a common repository or go in each pipeline repo. ?
Ignore, if this is too specific

Andrew Jones

I think data contracts should be in the place your data generators most expect them to be.
If their software/product engineers, then that should be probably be where they manage related things like their REST APIs, infrastructure, etc.
For some organisations that will be a common monorepo, for others that will be in the repo specific for their service. A data contract most likely belongs to a service, so should live there. (Service should be a broad definition, so would include pipelines)
Having said that, having a common place to query each data contract is very useful for tooling, discovery, etc. That could be a schema registry, a data catalog (with a good API), etc.
Hope that helps! Let me know if you have any follow-up questions 🙂

Toxicafunk

In the free chapter (ch. 1) you mention the lack of expectations, reliability and autonomy as the main problems a data contract should address.

  1. it seems most of these problems are at least partially addressed by a data mesh architecture, Does your book discuss Data Contracts as a part of a Data Mesh initiative or do you consider contracts by themselves enough to address them?
  2. Should a data contracts also address the why we care about exposing a given field or should the why be addressed at the scope of pipelines instead of fields?
Andrew Jones

Hey Toxicafunk,


  1. Yes, there’s a section in chapter 2 titled Data contracts and the data mesh that aims to answer that 🙂 In short, they are very much complimentary, and data mesh is one of the inspirations behind data contracts.
    I believe it would be very hard to implement a data mesh without data contracts. On the flip side, you can certainly use data contracts without committing to a full data mesh architecture.
    2.
    That’s a great question! I think the why should be captured in the contract. There is a reason that field was exposed, and documenting that would help others know what that field is useful for.
    If not in the data contract, it might instead be in a requirements document or similar, but they tend to go out of date. The data contract will be maintained for as long as the data is.
    I love this question though. Always start with why!
Kathryn Armitstead

Are data contracts the sort of concept that one could introduce to an organisation gradually, eg across 2 teams and then aim to get buy-in from other teams, or is to something that needs general buy-in to be effective? I like what I’ve heard about them, but finding how to introduce into a very busy organisation with lots of other priorities Would be a big challenge.

Andrew Jones

Kathryn Armitstead great question!
I think you can do so gradually. Focus on where they add most value, start introducing the language and showing how they help. Build the minimal tooling to support that. Then go from there.
I talk a lot about this in chapter 3, How to Get Adoption in Your Organization.
As you say, getting this prioritised in an organization will be a big challenge. You need to be able to articulate the value of it, and the value of data in general. Then once you’ve done that, you need to align the adoption of data contracts with the incentives of the teams you’re asking to prioritise.
But this shouldn’t put you off! Often, I see people say “there’s no way this will be prioritised”, but there is! Software engineers, product managers, directors, etc will care about data quality if they understand the issues being caused, the cost of those issues, and the value from being better 🙂

Kathryn Armitstead

great answer - thanks Andrew Jones . I love the option of being able to progress with minimal tooling to demonstrate value and get buy-in.

Sven

What are the big differences between a Data Contract and a Data Catalog? I just saw different definitions. Is it possible to combine them?

Andrew Jones

Hey Sven, I’d say they’re quite different, but related.
A data catalog is a great tool for discovering the data you have, but they don’t do much more than that.
A data contract is how that data is being generated and managed.
A data contract will include the schema, some documentation, the owner, SLOs, etc. These are all things you would want in your data catalog, and we populate our data catalog from the data contract.
I talk a bit about data catalogs and data contracts in chapter 9.

Sven

Perfect, thanks!

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.