Data Quality Lec 3
Data Quality Lec 3
Data Quality Lec 3
Shakeel Waris
Data helps people and organizations make more informed decisions, significantly increasing the
likelihood of success. By all accounts, that seems to indicate that large amounts of data are a good thing.
However, that’s not always the case. Sometimes data is incomplete, incorrect, redundant, or not
applicable to the user’s needs.
But fortunately, we have the concept of data quality to help make the job easier. So let’s explore what is
data quality, including what are its characteristics and best practices, and how we can use it to make
data better.
But What Is Data Quality, in Practical Terms? Data quality measures the condition of data, relying on
factors such as how useful it is to the specific purpose, completeness, accuracy, timeliness (e.g., is it up
to date?), consistency, validity, and uniqueness.
Data quality analysts are responsible for conducting data quality assessments, which involve assessing
and interpreting every quality data metric. Then, the analyst creates an aggregate score reflecting the
data’s overall quality and gives the organization a percentage rating that shows how accurate the data
is.
To put the definition in more direct terms, data quality indicates how good the data is and how useful it
is for the task at hand. But the term also refers to planning, implementing, and controlling the activities
that apply the needed quality management practices and techniques required to ensure the data is
actionable and valuable to the data consumers.
Completeness
Completeness measures the data's ability to deliver all the mandatory values that are available
successfully.
Consistency
Data consistency describes the data’s uniformity as it moves across applications and networks and when
it comes from multiple sources. Consistency also means that the same datasets stored in different
locations should be the same and not conflict. Note that consistent data can still be wrong.
Timeliness
Timely data is information that is readily available whenever it’s needed. This dimension also covers
keeping the data current; data should undergo real-time updates to ensure that it is always available
and accessible.
Uniqueness
Uniqueness means that no duplications or redundant information are overlapping across all the
datasets. No record in the dataset exists multiple times. Analysts use data cleansing and deduplication
to help address a low uniqueness score.
Validity
Data must be collected according to the organization’s defined business rules and parameters. The
information should also conform to the correct, accepted formats, and all dataset values should fall
within the proper range.
Enhancing the data quality is a critical concern as data is considered as the core of all activities within
organizations, poor data quality leads to inaccurate reporting which will result inaccurate decisions and
surely economic damages.
1.Training Staff
Training Staff
Before thinking about implementing data quality solutions, first we must minimize the data quality
problems resulted by in-organization human activities such as data entry. Also all developers and
database administrators must have a good knowledge of the business process and must refer to a
unified schema when developing and designing databases and applications.
The other way to improve data quality is by implementing data quality solutions. Data quality solutions
is a set of tools or application that perform quality tasks such as:
Knowledge base creation: a knowledge base is a machine-readable resource for the dissemination of
information.
Data profiling: is the process of examining the data available from an existing information source (e.g. a
database or a file) and collecting statistics or informative summaries about that data.
Data matching: Data matching describes efforts to compare two sets of collected data using
technologies such as Record Linkage and Entity resolution.