Data Quality Lec 3

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Data Science BS CS 6th/ADP. Mr.

Shakeel Waris

Data Quality In Data Science:


Data saturates the modern world. Data is information, information is knowledge, and knowledge is
power, so data has become a form of contemporary currency, a valued commodity exchanged between
participating parties.

Data helps people and organizations make more informed decisions, significantly increasing the
likelihood of success. By all accounts, that seems to indicate that large amounts of data are a good thing.
However, that’s not always the case. Sometimes data is incomplete, incorrect, redundant, or not
applicable to the user’s needs.

But fortunately, we have the concept of data quality to help make the job easier. So let’s explore what is
data quality, including what are its characteristics and best practices, and how we can use it to make
data better.

Definition of Data Quality.


In simple terms, data quality tells us how reliable a particular set of data is and whether or not it will be
good enough for a user to employ in decision-making. This quality is often measured by degrees.

But What Is Data Quality, in Practical Terms? Data quality measures the condition of data, relying on
factors such as how useful it is to the specific purpose, completeness, accuracy, timeliness (e.g., is it up
to date?), consistency, validity, and uniqueness.

Data quality analysts are responsible for conducting data quality assessments, which involve assessing
and interpreting every quality data metric. Then, the analyst creates an aggregate score reflecting the
data’s overall quality and gives the organization a percentage rating that shows how accurate the data
is.

To put the definition in more direct terms, data quality indicates how good the data is and how useful it
is for the task at hand. But the term also refers to planning, implementing, and controlling the activities
that apply the needed quality management practices and techniques required to ensure the data is
actionable and valuable to the data consumers.

Data Quality Dimensions


There are six primary, or core, dimensions to data quality. These are the metrics analysts use to
determine the data’s viability and its usefulness to the people who need it.
Accuracy
The data must conform to actual, real-world scenarios and reflect real-world objects and events.
Analysts should use verifiable sources to confirm the measure of accuracy, determined by how close the
values jibe with the verified correct information sources.

Completeness
Completeness measures the data's ability to deliver all the mandatory values that are available
successfully.

Consistency

Data consistency describes the data’s uniformity as it moves across applications and networks and when
it comes from multiple sources. Consistency also means that the same datasets stored in different
locations should be the same and not conflict. Note that consistent data can still be wrong.

Timeliness

Timely data is information that is readily available whenever it’s needed. This dimension also covers
keeping the data current; data should undergo real-time updates to ensure that it is always available
and accessible.

Uniqueness

Uniqueness means that no duplications or redundant information are overlapping across all the
datasets. No record in the dataset exists multiple times. Analysts use data cleansing and deduplication
to help address a low uniqueness score.

Validity
Data must be collected according to the organization’s defined business rules and parameters. The
information should also conform to the correct, accepted formats, and all dataset values should fall
within the proper range.

Why Data Quality is Important?

Enhancing the data quality is a critical concern as data is considered as the core of all activities within
organizations, poor data quality leads to inaccurate reporting which will result inaccurate decisions and
surely economic damages.

How to improve Data Quality?


Data quality improvement is achieved by:

1.Training Staff

2. Implementing data quality solutions

Training Staff

Before thinking about implementing data quality solutions, first we must minimize the data quality
problems resulted by in-organization human activities such as data entry. Also all developers and
database administrators must have a good knowledge of the business process and must refer to a
unified schema when developing and designing databases and applications.

Implementing data quality solutions

The other way to improve data quality is by implementing data quality solutions. Data quality solutions
is a set of tools or application that perform quality tasks such as:

Knowledge base creation: a knowledge base is a machine-readable resource for the dissemination of
information.

Data de-duplication: Remove duplicated information based on a set of semantic rules.

Data cleansing: Removing unwanted characters and symbols from values.

Data profiling: is the process of examining the data available from an existing information source (e.g. a
database or a file) and collecting statistics or informative summaries about that data.

Data matching: Data matching describes efforts to compare two sets of collected data using
technologies such as Record Linkage and Entity resolution.

You might also like