In this article, I introduce the notion of Data Quality of Service (Data QoS) — the result of combining Data Quality (DQ) with Service-Level Agreements (SLA). I'll cover the concept overview and then dive into the elements composing Data QoS, focusing on DQ and Service-Level Indicators and explaining how you can group them together.
As your need for observing data grows with the maturity of your business, you realize that the number of attributes you want to measure only brings more complexity than simplicity.
That's why, back in 2021, and taking inspiration from Mandeleev's work on classifying atomic elements in physics, I came up with the idea of combining data quality and service levels into a single table.
Quality of Service (QoS) is a well-established concept in network engineering. It measures the overall performance of a service, such as a telephony, computer network, or cloud computing service, particularly the quality seen by the network users. In networking, you must consider several criteria to quantitatively measure the QoS, such as packet loss, bit rate, throughput, transmission delay, and availability.
Regarding data, the industry standard for trust has often been limited to data quality — something I've agreed with for a long time.
At the 2017 Spark Summit, I even introduced CACTAR (Consistency, Accuracy, Completeness, Timeliness, Accessibility, and Reliability), an acronym for six data quality dimensions relayed in this Medium article. And although there is no official standard, the EDM Council added a 7th one.
Let's break down the seven data quality dimensions:
The measurement of the veracity of data to its authoritative source. Data might be provided, but that doesn't mean it's incorrect.
Accuracy refers to how precise data is. It can be assessed by comparing it to the original documents and trusted sources or confirming it against business rules.
Examples:
Fun fact: Many accuracy problems come from the data input. If you have data entry people on your team, reward them for accuracy, not only speed!
Data is required to be populated with a value (aka not null, not nullable). Completeness checks if all necessary data attributes are present in the dataset.
Examples:
Fun fact: A primary key is always a required field.
Data content must align with required standards, syntax (format, type, range), or permissible domain values. Conformity assesses how closely data adheres to internal, external, or industry-wide standards.
Examples:
Fun fact: ISO country codes are 2 or 3 digits (like FR and FRA for France). If you mix up the two in the same datasets, it's not a conformity problem; it's a consistency problem.
Data should retain consistent content across data stores. Consistency ensures that data values, formats, and definitions in one group match those in another.
Examples:
Fun fact: I was born in France on May 10th, 1971, but I am a Libra (October). When expressed as strings, date formats are transformed through a localization filter. So, being born on October 5th makes my date representation 05/10/1971 in Europe, but 10/05/1971 in the U.S.
All records are contained in a data store or data source. Coverage relates to the extent and availability of data present but potentially absent from a dataset.
Examples:
The data must represent current conditions and be available and accessible when needed. Timeliness gauges how well data reflects current market/business conditions and its availability when needed.
Examples:
Fun fact: Forty-five million Americans change addresses every year.
How much data can be duplicated? It supports the idea that no record or attribute is recorded more than once. Uniqueness means each record and attribute is one-of-a-kind, aiming for a single, unique data entry (yeah, one can dream, right?).
Examples:
Fun fact: data replication is not bad per se; involuntary data replication is!
Those seven dimensions are pretty well-rounded. As an industry, it's probably time to say, "Good enough." Of course, it completely ruins my CACTAR acronym (and it's great backstory).
But I still feel it is not enough. Data quality does not answer questions about end-of-life, retention period, and time to repair when broken.
So now let's look at service levels.
While data quality describes the condition of the data, service levels will give you precious information on the expectations around availability, the condition, and more.
Here is a list of service-level indicators you can apply to your data and its delivery. You will have to set some objectives (service-level objectives or SLOs) for your production systems and agree with your users and their expectations (aka setting service-level agreements or SLAs).
In simple terms, is my database accessible? A data source may become inaccessible for various reasons, such as server issues or network interruptions. The fundamental requirement is for the database to respond affirmatively when you use the JDBC's connect() method.
Throughput is about how fast I can access the data. It can be measured in bytes or records by unit of time.
How often will your data have errors, and over what period? What is your tolerance for those errors?
General availability in software and product management means the product is now ready for public use, fully functional, stable, and supported. Here, it applies to when the data will be available for consumption. If your consumers require it, it can be a date associated with a specific version (alpha, beta, v1.0.0, v1.2.0, etc.).
The date at which your product will not have support anymore.
For data, it means that the data may still be available after this date, but if you have an issue with it, you won't be offered a break-fix solution. It also means that you, as a consumer, will likely have to adopt a replacement version.
Fun fact: Windows 10 is supported until October 14, 2025.
The date at which your product will not be available anymore. No support, no access. Rien. Nothing. Nada. Nichts.
For data, this means that the connection will fail or the file will not be available. It can also be that the contract with an external data provider has ended.
Fun fact: Google Plus was shut down in April 2019. You can't access anything from Google's social network after this date.
How long are we keeping the records and documents? There is nothing extraordinary here. Like most service-level indicators, Re length can vary by use case and legal requirements.
How often is your data updated? Daily? Weekly? Monthly? A linked indicator to this frequency is the time of availability, which applies well to daily batch updates.
Measures the time between the production of the data and its availability for consumption.
How fast can you detect a problem? A problem can either be a complete break, like your car not starting on a cold morning, or something slow, like data feeding your SEC (Security Commission for Publicly Traded Companies) being wrong for several months.
How fast do you guarantee the detection of the problem? You can also see this service-level indicator called "failure detection time."
[POPOUT] Fun fact: squirrels (or another similar creature) ate the gas line on my wife's car. We detected the problem as the gauge went down quickly, even for a few miles. Do you even drive the car to the mechanic?
Once you see a problem, how much time do you need to notify your users? This is, of course, assuming you know your users.
How long do you need to fix the issue once it is detected? This is a prevalent metric for network operators running backbone-level fiber networks.
Of course, there are a lot more service-level indicators that will come over time. Agreements follow indicators and can include penalties. You see that the description of the service can become very complex.
To represent the elements, I needed to identify precisely each element on two axes:
Each element received additional attributes, as shown in the following illustration.
The periods are time-sensitive elements. Some are pretty obvious, as "end of life" is definitely after "general availability."
Classification of some elements, however, is more subtle. For example, when data comes to your new data store, you will check accuracy before consistency, and you can check uniqueness only when you have significant data. The elements have no chronological link, but they happen in sequence.
The second classification to find was about grouping. How can we group those elements? Is there a logical relation between them that would make sense?
Here's what I came up with:
There are a lot of benefits to the classification and definition of the elements forming the Data QoS, per the service-level indicators and the data quality dimensions:
In this article, I shared my strong feeling, developed over the years, that data quality is insufficient. Although data quality is becoming increasingly normalized, it still lacks service levels.
Service levels can have a profusion of indicators and are open by nature. Combining data quality and service levels can create a higher level of dimensions/indicators grouped in Data QoS.
The representation of Data QoS can be in a Mendeleev-like periodic table featuring each element in a neighboring context.
Do not hesitate to start a conversation and see how Data QoS can help your organization.