Data quality, the most boring but essential topic in data At its core data quality can be broken down into eight different dimensions: 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 – The correctness of the data, ensuring it reflects the real-world scenario it is intended to model and represent 𝐂𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲 – Uniformity across data sources, ensuring that the data is the same between storage and usage and does not conflict within different datasets 𝐂𝐨𝐦𝐩𝐥𝐞𝐭𝐞𝐧𝐞𝐬𝐬 – All necessary data is present, with no missing elements that could impact analysis or decision-making 𝗛𝗮𝘀 𝗜𝗻𝘁𝗲𝗴𝗿𝗶𝘁𝘆 – Data continues to be recorded and relationships maintained as intented 𝗠𝗮𝗶𝗻𝘁𝗮𝗶𝗻𝘀 𝗖𝗼𝗻𝗳𝗼𝗿𝗺𝗶𝘁𝘆 – Data always follows standard definitions, set to ensure consistency around type, size and format 𝐓𝐢𝐦𝐞𝐥𝐢𝐧𝐞𝐬𝐬 – Data is up-to-date and available when needed, ensuring that decisions are based on the most current information 𝐔𝐧𝐢𝐪𝐮𝐞𝐧𝐞𝐬𝐬 – No duplicate records exist within the organisation, helping maintain the integrity and accuracy of the data (creating trust) 𝐕𝐚𝐥𝐢𝐝𝐢𝐭𝐲 – Ensuring the data remains accurate and consistent throughout its lifecycle and conforms to the agreed structure, data quality standards/ rules or list of values Use these dimensions to help frame, understand and measure the quality of your data Check out my past newsletter article on this topic (https://lnkd.in/gF4ETX2y), where I define what data quality really means and the underlying root causes that lead to issues. Then the week after, we will jump into some approaches to fixing data quality issues, sharing what you should and should not do! If you still haven’t signed up (and not soaked up all that data ecosystem knowledge), subscribe for more! #data #dataquality #datastrategy #dataecosystem #DylanDecodes
These are great, but coming from a software / UX world I would also expect: valuable. Why are we putting effort into collecting, cleaning, processing, storing and serving this data? It doesn't matter how spiffy it is, if it's the wrong data for our users. Are all tables / columns equally valuable to our users? Do all our users value the different parts of the data in the same way? Etc.
I have twelve data quality dimensions: Timeliness and Availability Completeness Uniqueness Validity Conformity Uniformity Consistency Integrity Accuracy Reasonability and Anomalies and Outliers Confidentiality and Security Clarity and Usefulness More reading: http://www.joakimdalby.dk/HTM/DimensionalModeling.htm#SectionDQ
Most of my work in dealing with data has been related to address quality issues. Probably most here have heard the phrase "garbage in, garbage out". The worst things I have found to date is that one of the internal users input an emoji character into a phone number field. Another one is for an e-mail I have repeated seen something along the lines of "[email protected]". That sort of behavior should be penalized somehow considering that the fields can be left blank. In terms of data validation, fields should accept only certain characters and adhere to a certain format that could be enforced via a regex rule whenever the end user can input values freely. Also, when the possible values of a field are known, it is better to implement a drop-down list than allow the end user to input values freely. As with other things, detecting and addressing the data quality issues requires additional work and time. That's why a few years back, when somebody mentioned out of the blue, that we were going to implement a data warehouse, I knew that it was a pipe dream. Via automation and the availability of some new tools, I can finally see that happening in the long run.
The more explicit you can be about exactly how these dimensions are measured, the better your data will be. Another thing that I have found is that it’s easier to spot data quality problems in the data warehouse (vs in the system of record) but the best place to clean it up is in the business system. Figuring that handshake out is incredibly important.
Data quality is the foundation you can't ignore. I don't think it's boring. What is boring is having to unpick results to find where the input data was not of quality. I think a few more dimensions need to be added, such as source: where did the data come from. Also think FAIR and ALCOA+
Also important to understand that many data quality issues listed here can be caused by things unrelated to underlying data quality. For example a failure to orchestrate data pipelines and notice / recover from failed runs can lead to incomplete data (even if the underlying data is complete).
I think we can add “Relevance” in there somewhere. I mean, sometimes you can have the most accurate, clean and error free data, yet it might have low quality if the information itself is not business relevant.
Implement automated data quality scoring across these dimensions, then tie executive compensation to those metrics. What gets measured AND incentivized gets managed. This approach transformed data governance at multiple Fortune 500s I've advised.
Great breakdown of things to consider when looking at the data quality. I had countless headaches thanks to missing conformity between data sources 😅
Data Analyst | Tableau Consultant | Author of How to Become a Data Analyst
4dDylan i just am going to do ai now and so i dont need to worry about this. Fire the employees.