We are in the TOP 10 Great Place to Work CERTIFIED™ Take a look here!

What is data quality and how it works (2024)

What is data quality and how it works (2024)
What is data quality and how it works (2024)

Today, making decisions based on data is a must for successful organizations. In this context, the importance of data quality cannot be overstated.

Organizations rely on data to make critical decisions, and the quality of this data directly impacts the accuracy and effectiveness of these decisions.

At LoopStudio, we have encountered the real importance of data quality alongside the development of our data engineering services, and now we want to share what we learn with you.

Data quality is a key element to be understood by a Data Engineer, and in this post we explore the essence of data quality, why it is crucial, and the various types that organizations need to consider.

Why do we need data quality?

Data is the lifeblood of modern enterprises, powering everything from strategic decision-making to daily operations. Ensuring high data quality is vital for several reasons:

1. Informed Decision-Making

High-quality data enables organizations to make well-informed decisions. Accurate, timely, and relevant data allows leaders to develop strategies that are grounded in reality, reducing risks and improving outcomes.

2. Operational Efficiency

Poor data quality can lead to operational inefficiencies. For example, incorrect customer information can result in failed deliveries, wasted resources, and dissatisfied customers. Ensuring data quality helps streamline operations and reduces costs.

3. Regulatory Compliance

Many industries are subject to stringent regulations that require accurate and complete data. High data quality helps organizations comply with these regulations, avoiding potential fines and legal issues.

4. Customer Trust

Inaccurate data can erode customer trust. Ensuring that data is correct and up-to-date is essential for maintaining strong relationships with customers and safeguarding the organization’s reputation.

But, what is Data Quality?

Data quality refers to the condition of data based on factors such as accuracy, completeness, reliability, and relevance. High-quality data accurately represents real-world conditions and is fit for its intended use in operations, decision-making, and planning.

We must remember that data is no other thing but the representation of transactions or events that happened in the real world. So, as we gain more data quality of our information, we are looking by better glass the reality that we want to understand.

Here are some key dimensions of data quality:

  1. Accuracy: Data must be correct and free from errors. Inaccurate data can lead to faulty analyses and misguided decisions.
  2. Completeness: All necessary data should be present. Missing data can result in incomplete analyses and overlooked insights.
  3. Consistency: Data should be consistent across different systems and datasets. Inconsistencies can cause confusion and misinterpretations.
  4. Timeliness: Data must be up-to-date and available when needed. Outdated data can be irrelevant and misleading.
  5. Validity: Data should conform to defined formats and standards. Invalid data can cause system errors and unreliable results.
  6. Uniqueness: Each record should be unique, without duplication. Duplicate data can inflate figures and distort analyses.

Types of Data Quality

Understanding the types of data quality issues can help organizations address and prevent them effectively. Here are some common types:

1. Structural Data Quality

This involves the format and structure of data. Ensuring data is correctly formatted and adheres to defined schemas is crucial. Examples include proper date formats, consistent use of units of measure, and standardized coding systems.

2. Content Data Quality

This focuses on the accuracy and reliability of the data content. Ensuring that data values are correct and reflect the real world is essential. Examples include correct spelling of names, accurate numerical values, and valid categorical entries.

3. Contextual Data Quality

This type relates to the relevance and appropriateness of data within a given context. Ensuring that data is suitable for its intended use is important. Examples include relevance of data for specific analyses and appropriateness of historical data for trend analysis.

4. Temporal Data Quality

This involves the timeliness and currency of data. Ensuring that data is up-to-date and reflects the most current information is critical. Examples include real-time data updates and synchronization across systems.

5. Referential Data Quality

This focuses on the integrity of relationships within data. Ensuring that data maintains valid references and linkages is key. Examples include maintaining foreign key constraints and ensuring referential integrity in relational databases.

Data Quality Hands-on

We have discussed the theory and concepts of data quality, so let’s delve into the application of these concepts in a real scenario. Assume we work in a company with this data warehousing structure:

Using Airflow as an orchestrator, we move data from several systems and endpoints into the data warehouse instance of Snowflake, where we clean and transform the data using dbt, to feed several dashboards and charts on Metabase.

At LoopStudio, we have strong experience using this kind of architecture, which has several advantages in terms of consistency and data quality. That’s why we are using it as an example.

Error handling

The first step to ensure data quality in our process is the most intuitive: error handling on the pipelines. In every DAG that we create, the errors and the subsequent actions should be managed with the best practices in DAG writing.

Alerting on a failure is a key element inside any pipeline or ETL process. A message on Slack or an email is a great way to stay aware of any failure that could arise. It’s imperative to be aware as soon as possible when a pipeline is failing and because of that, the downstream processes are being affected.

To share a possible implementation, see this code in which we set up a callback on the default arguments of our Airflow dag.

default_args = {    
    "start_date": days_ago(1),
    dag_args.RETRIES: RETRIES,
    "on_failure_callback": task_failure_alert,
}

 

And this is calling a function that triggers a message on a SNS service on AWS:

def task_failure_alert(context):
    alert_message = f"Task has failed, task_instance_key_str: {context['task_instance_key_str']}, exception: {context['exception']}"
    print(
        f"Task has failed, task_instance_key_str: {context['task_instance_key_str']}, exception: {context['exception']}"
    )
    sns.alert(alert_message)

This way the failures can be quickly notice and tackle by the development team.

Expected output

The fact that our pipeline works does not mean that the data is arriving successfully. We could be creating data products with no data or data that is not correct. It is important that the data that is landing and we are transforming meets expectations.

That is why it is relevant to add data quality tests. These tests can be made natively on dbt or can be added by external providers. One of our favorites is Great Expectations. This framework can be used in dbt with the dbt_expectations package.

tests:
  - dbt_expectations.expect_row_values_to_have_recent_data:
      datepart: day
      interval: 1
      row_condition: 'id is not null'

In this example, we can see that the test is looking for recent data in a given column. If the column does not show any records for the last day, then a warning will rise. This way, even if the model is being built, we can control that the update is working as expected.

This package brings many useful functions to apply to our dbt jobs. Checking if the model created has rows, if the columns are the correct data type, or if a time column has records for the last couple of days can be incredibly beneficial to ensure the reliability of our data products.

Using these tests, we can have confidence that if our pipelines do not fail, we will get the expected result from them, and otherwise, we will be alerted. It’s crucial that these failures show up in the pipeline update and not in the decision-making process.

Anomaly detection

We are confident now that our pipelines will handle errors accordingly and that the tables we are making will have the format and output that we expect, but is this enough to ensure our data quality is sufficient?

Imagine we have a dashboard with critical records, but it is not being consumed frequently. And for some periods, there is a decrease of 50% in the row count for some values because those are not being loaded. Perhaps if a decision-maker looks at the numbers, they can detect the error, but the pipeline does not.

Airflow won’t find any error in processing the data, Great Expectations will say that rows are being generated and in the correct datatype. But still, we have an error, and there is no way to be aware of it with these methods.

That’s why we should also integrate anomaly detection tests into our tables. This test looks into the historical data and checks if the row count is consistent or if it varies unexpectedly. This way, we can get alerted when something odd happens and take action if required.

A good example of a service that provides this kind of tests is Elementary, a dbt-native service that allows to track table volume anomalies, among other kinds of tests such as schema modification and column anomalies.

Not every anomaly detection should trigger an action. Anomalies may happen, and we can expect that the data will be different because of that. But sometimes, those anomalies in the data are not a reflection of reality but an error in the process, and that is when we need to take action in response.

Conclusion

In conclusion, data quality is a critical component of any organization’s data strategy. High-quality data is essential for accurate decision-making, operational efficiency, regulatory compliance, and customer trust.

At LoopStudio, we have seen firsthand how vital data quality is in our projects, and we are committed to sharing our insights and best practices. By understanding and addressing the various types of data quality issues, organizations can ensure their data remains a valuable and reliable asset.

Ensuring data quality is an ongoing process that requires continuous monitoring, validation, and improvement. As data continues to grow in volume and complexity, the need for robust data quality practices will only become more important.

Implementing comprehensive error handling, expected output verification, and anomaly detection will help maintain high data quality standards, enabling organizations to leverage their data effectively and confidently.