A quick and not-so-dirty introduction to data contracts:
Data Contract: A data contract acts as an agreement between multiple parties, specifically, a data producer and its consumer(s).
Ok, that's it, job complete! You know everything about data contracts. Not quite, I guess…
In this article, I dig a little deeper. You'll learn about use cases, why data contracts build trust, and how our industry got here. We'll also cover open standards and conclude with some examples.
Imagine that Beth, a data scientist in your organization, wants to access some applicants' data from the company's HR package. With a need to run models, she knows accessing data via API would be too slow and not resource-efficient. She needs a data pipeline that will extract the data from the HR solution, transform it into something usable, and load it, let's say, in a lakehouse.
So far, nothing surprising, right?
After looking at the data, however, Beth just sees various emails and phone numbers since the transformation process anonymized the applicants' names, as expected. So where does she find the information about the datasets she needs, which fields are anonymized, when the data is available and updated, etc?
She could resort to finding the information in a wiki, Confluence, SharePoint, or another system. Still, we all know that documentation is a pain to maintain, usually causing people to fall behind when measuring service levels — it is quasi-inexistent.
Here's a great example of where the data contract comes in. And as you will see in the rest of this article, it is much more than a documentation tool.
As you saw from Beth's story, the data contract:
Let's dive in deeper!
Data consumers like Beth are skeptical about the stability of the data they find. And, very often, the first thing they do is make a copy of the data for themselves. One of their reasons is that they don't know if the data will be there in the future.
The data producer (or the data owner) needs to show and guarantee a promise to create trust.
While the term "data contract" may be relatively new (although it creates much confusion with cell phone contracts), the concept behind it and its usage is nothing but. Let's check out the history:
Fast forward to the present, and more and more companies are switching to a federated (and not decentralized) organizational model. Consumer teams on the ground (on the factory floor) are no longer passive participants. They share feedback, define local policies, and make better decisions based on better metadata. Is it working? It's too soon to tell, but I don't see the flaws thus far.
Why use data contracts? They empower teams, enhance organizational governance, and decrease the time to market new applications.
I have seen and heard about so many forms and shapes a data contract could take. While some companies make a formal contract with signatures at the bottom, others use free-flow Excel sheets, Word documents, and Python. (Honestly, I don't know which one of those formats is the worst).
For me, a data contract should be:
YAML, a standard file format used in software engineering, fits those criteria.
Of course, having a file format is not enough. There's also the content of the file and its structure to consider. If three separate vendors defined three different versions of HTML, for instance, would the web have been such a success?
With data contracts, the need for standardization is even more vital. As you will see, a data contract has multiple parts fulfilling different needs. For example, data quality can be evaluated by one vendor while another focuses on the history of stakeholders, and a third analyzes service-level objectives (SLOs). Human creativity becomes the limit.
If we had disparate standards, there would be tons of information lost and time wasted. That's why PayPal originally open-sourced the contract, and a non-profit, user-driven organization like AIDA User Group assumed the responsibility to continue to develop, nurture, and foster the standard.
Companies like ProfitOptics understand the benefits and are helping customers implement and deploy data contracts. Contact us via GitHub to join the working open standard group.
Data contracts follow a standard called Open Data Contract Standard (ODCS). It currently operates under version v2.2, with much work going into v2.3, which will be upward compatible. ODCS embraces other standards and best practices like semver (Semantic Versioning), Kubernetes naming convention for YAML, and even idempotency.
The contract covers eight categories:
This section contains general information about the contract, like name, domain, version, and much room for information.
This section describes the dataset and the schema of the data contract. It is the support for data quality (detailed in the next header). A data contract focuses on a single dataset with several tables (and, obviously, columns).
This category describes data quality rules & parameters and is tightly linked to the dataset & schema section above. For more details, check out my 2018 piece on Data Quality.
This section, currently experimental, explains pricing if/when you bill your customer for using this data product.
This critical data contract section lists all stakeholders and the history of their relation with this data contract.
This section outlines the roles that a consumer may need to access the dataset depending on their required access.
This section describes the service-level agreements (SLA) in the data contract. Unfortunately, data and SLAs are not documented enough just yet. Stay tuned for more!
This section covers custom & other properties in a data contract using a list of key/value pairs. It offers flexibility without requiring you to create a new template version whenever someone needs additional properties.
Figure 1 below shows the eight categories, as well as the stakeholders. I will focus on the stakeholders, tools, and integration in a future article.
I know it's been a lot of theory so far, but it looks like you've stuck with me! Let's apply these concepts and look at a few examples.
In this example, the data contract defines a column called "txn_ref_dt" coming from a table called "tbl."
As you can see, the contract details the column's logical and physical types. They are the same in this example, but this will not always be the case.
Another key feature of the data contract is its ability to play well with others. The notion of authoritative definition plays a critical role.
In the following example, the column "rcvr_cntry_code" is defined in Collibra as a specific asset. As this column results from a transformation, the reference implementation is in GitHub, and the contract user can find all about it. Knowing the authorities is one of the keys to computational or data governance.
I hope I convinced you of the importance of data contracts. They quantify trust, offer excellent documentation, provide flexibility, and do much more.
What's next for you? Experiment, build your own data contract on a simple dataset, understand its benefits, and always remember that ProfitOptics can help.