Your Data Quality Checks Are Worth Less (Than You Think)
How to deliver outsized value on your data quality program
Photo by Wolfgang Weiser on Unsplash
Over the last several years, data quality and observability have become hot topics. There is a huge array of solutions in the space (in no particular order, and certainly not exhaustive):
dbt tests
SQLMesh audits
Monte Carlo
Great Expectations
Soda
Sifflet
Regardless of their specific features, all of these tools have a similar goal: improve visibility of data quality issues, reduce the number of data incidents, and increase trust. Despite a lower barrier to entry, however, data quality programs remain difficult to implement successfully. I believe that there are three low-hanging fruit that can improve your outcomes. Let’s dive in!
Hint 1: Focus on process failures, not bad records (when you can)
For engineering-minded folks, it can be hard pill to swallow that some number of “bad” records will not only flow into your system but through your system, and that may be OK! Consider the following:
Will the bad records flush out when corrected in the source system? If so, you may go to extraordinary lengths in your warehouse or lakehouse to correct data that is trivial for a source system operator to fix, with the result that your reporting is correct on the next refresh
Is the dataset useful if it’s “directionally correct” in aggregate? CRM data is a classic example, since many fields need to be populated manually, and there’s a relatively high error rate compared to automated processes. Even if these errors aren’t corrected, as long as they’re not systemic, the dataset may still be useful
Is accuracy of individual records extremely important? Financial reporting, operational reporting on sensor data from expensive machinery, and other “spot-critical” use cases deserve the time and effort needed to identify (and possibly isolate, remove, or remediate) bad records
If your data product can tolerate Type 1 or Type 2 issues, fantastic! You can save a lot of effort by focusing on detection and alerting of process failures rather than one-off or limited anomalies. You can measure high-level metrics skimmed from metadata, such as record counts, unique counts of key columns, and min / max values. A rogue process in your application or SaaS systems can generate too many or too few records, or perhaps a new enumerated value has been added to a column unexpectedly. Depending on your specific use cases, you may need to write custom tests (e. g., total revenue by date and market segment or region), so make sure to profile your data and common failure scenarios.
On the other hand, Type 3 issues require more complex systems and decisions. Do you move bad records to a dead-letter queue and send an alert for manual remediation? Do you build a self-healing process for well-understood data quality issues? Do you simply modify the record in some way to indicate the data quality issue so...
Data & Analytics | Co-Founder at Joon Solutions
7moReduction in run time also leads to a reduction in hard cash spent on consumption based query engines. Awesome!