Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Data Mining: Data mining in general terms means mining or digging deep into data that is in

different forms to gain patterns, and to gain knowledge on that pattern. In the process of data
mining, large data sets are first sorted, then patterns are identified and relationships are established
to perform data analysis and solve problems.

Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a set
of categories (subpopulations), a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further
approve it. It is a two-step process such as
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model on
test data and hence estimate the accuracy of the classification rules.
Test data are used to estimate the accuracy of the classification rule
Training and Testing:
Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should get
aside in order not to get hurt. So, this is his training part to move away. While Testing if the person
sees any heavy object coming towards him or falling on him and moves aside then the system is
tested positively and if the person does not move aside then the system is negatively tested.
The same is the case with the data, it should be trained in order to get the accurate and best results.
There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format)

How Does Classification Works?


With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification

Building the Classifier or Model


• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their associated
class labels.
• Each tuple that constitutes the training set is referred to as a category or class. These tuples
can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.

Classifiers can be categorized into two major types:

1. Discriminative: It is a very basic classifier and determines just one class for each row of
data. It tries to model just by depending on the observed data, depends heavily on the quality
of data rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the model that
generates the data behind the scenes by estimating assumptions and distributions of the
model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too
divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if
a user wants to check that if an email contains the word cheap, then that may be termed as
Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression

Classification and Predication in Data Mining


There are two forms of data analysis that can be used to extract models describing important classes
or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction
We use classification and prediction to extract a model, representing the data classes to predict
future data trends. Classification predicts the categorical labels of data with the prediction models.
This analysis provides us with the best understanding of the data at a large scale.
Classification models predict categorical class labels, and prediction models predict continuous-
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is
used as training data. The set of input data and the corresponding outputs are given to the algorithm.
So, the training data set includes the input data and their associated class labels. Using the training
dataset, the algorithm derives a model or the classifier. The derived model can be a decision tree,
mathematical formula, or a neural network. In classification, when unlabeled data is given to the
model, it should find the class to which it belongs. The new data provided to the model is the test
data set.
Classification is the process of classifying a record. One simple example of classification is to check
whether it is raining or not. The answer can either be yes or no. So, there is a particular number of
choices. Sometimes there can be more than two classes to classify. That is called multiclass
classification.
The bank needs to analyze whether giving a loan to a particular customer is risky or not. For
example, based on observable data for multiple loan borrowers, a classification model may be
established that forecasts credit risk. The data could track job records, homeownership or leasing,
years of residency, number, type of deposits, historical credit ranking, etc. The goal would be credit
ranking, the predictors would be the other characteristics, and the data would represent a case for
each consumer. In this example, a model is constructed to find the categorical label. The labels are
risky or safe.

How does Classification Works?


The functioning of classification with the assistance of the bank loan application has been
mentioned above. There are two stages in the data classification system: classifier or model creation
and classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the
learning process. The classification algorithms construct the classifier in this stage. A
classifier is constructed from a training set composed of the records of databases and their
corresponding class names. Each category that makes up the training set is referred to as a
category or class. We may also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level.
The test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new data
records. It includes:
• Sentiment Analysis: Sentiment analysis is highly helpful in social media
monitoring. We can use it to extract social media insights. We can build sentiment
analysis models to read and analyze misspelled words with advanced machine
learning algorithms. The accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.
• Document Classification: We can use document classification to organize the
documents into sections according to the content. Document classification refers to
text classification; we can classify the words in the entire document. And with the
help of machine learning classification algorithms, we can execute it automatically.
• Image Classification: Image classification is used for the trained categories of an
image. These could be the caption of the image, a statistical value, a theme. You can
tag images to train your model for relevant categories by applying supervised
learning algorithms.
• Machine Learning Classification: It uses the statistically demonstrable algorithm
rules to execute analytical tasks that would take humans hundreds of more hours to
perform.
3. Data Classification Process: The data classification process can be categorized into five
steps:
• Create the goals of data classification, strategy, workflows, and architecture of data
classification.
• Classify confidential details that we store.
• Using marks by data labelling.
• To improve protection and obedience, use effects.
• Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?


The data classification life cycle produces an excellent structure for controlling the flow of data to
an enterprise. Businesses need to account for data security and compliance at each level. With the
help of data classification, we can perform it at every stage, from origin to deletion. The data life-
cycle has the following stages, such as:

1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging
based on in-house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from
various devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view
and download in the form of dashboards.

What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output values.
The algorithm derives the model or a predictor according to the training dataset. The model should
find a numerical output when the new data is given. Unlike in classification, this method does not
have a class label. The model predicts a continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts
such as the number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular customer will
spend at his company during a sale. We are bothered to forecast a numerical value in this case.
Therefore, an example of numeric prediction is the data processing activity. In this case, a model or
a predictor will be developed that forecasts a continuous or ordered value function.

Classification and Prediction Issues


The major issue is preparing the data for Classification and Prediction. Preparing the data involves
the following activities, such as
1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values.
The noise is removed by applying smoothing techniques, and the problem of missing values
is solved by replacing a missing value with the most commonly occurring value for that
attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis
is used to know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following
methods.
• Normalization: The data is transformed using normalization. Normalization
involves scaling all values for a given attribute to make them fall within a small
specified range. Normalization is used when the neural networks or the methods
involving measurements are used in the learning step.
• Generalization: The data can also be transformed by generalizing it to the higher
concept. For this purpose, we can use the concept hierarchies.

NOTE: Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods


Here are the criteria for comparing the methods of Classification and Prediction, such as
• Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to
predict the class label correctly, and the accuracy of the predictor can be referred to as how
well a given predictor can estimate the unknown value.
• Speed: The speed of the method depends on the computational cost of generating and using
the classifier or predictor.
• Robustness: Robustness is the ability to make correct predictions or classifications. In the
context of data mining, robustness is the ability of the classifier or predictor to make correct
predictions from incoming unknown data.
• Scalability: Scalability refers to an increase or decrease in the performance of the classifier
or predictor based on the given data.
• Interpretability: Interpretability is how readily we can understand the reasoning behind
predictions or classification made by the predictor or classifier.

Difference between Classification and Prediction


The decision tree, applied to existing data, is a classification model. We can get a class prediction
by applying it to new data for which the class is unknown. The assumption is that the new data
comes from a distribution similar to the data we used to construct our decision tree. In many
instances, this is a correct assumption, so we can use the decision tree to build a predictive model.
Classification of prediction is the process of finding a model that describes the classes or concepts
of information. The purpose is to predict the class of objects whose class label is unknown using
this model. Below are some major differences between classification and prediction.

Classification Prediction
Classification is the process of identifying which Predication is the process of identifying the
category a new observation belongs to based on a missing or unavailable numerical data for a
training data set containing observations whose
new observation.
category membership is known.
In prediction, the accuracy depends on how
In classification, the accuracy depends on finding the
well a given predictor can guess the value of
class label correctly.
a predicated attribute for new data.
In classification, the model can be known as the In prediction, the model can be known as
classifier. the predictor.
A model or a predictor will be constructed
A model or the classifier is constructed to find the
that predicts a continuous-valued function
categorical labels.
or ordered value.
For example, We can think of prediction as
For example, the grouping of patients based on their
predicting the correct treatment for a
medical records can be considered a classification.
particular disease for a person.

Classification of Data Mining Systems


Data mining refers to the process of extracting important data from raw data. It analyses the data
patterns in huge sets of data with the help of several software. Ever since the development of data
mining, it is being incorporated by researchers in the research and development field.
With Data mining, businesses are found to gain more profit. It has not only helped in understanding
customer demand but also in developing effective strategies to enforce overall business turnover. It
has helped in determining business objectives for making clear decisions.
Data collection and data warehousing, and computer processing are some of the strongest pillars of
data mining. Data mining utilizes the concept of mathematical algorithms to segment the data and
assess the possibility of occurrence of future events.
To understand the system and meet the desired requirements, data mining can be classified into the
following systems:
• Classification based on the mined Databases
• Classification based on the type of mined knowledge
• Classification based on statistics
• Classification based on Machine Learning
• Classification based on visualization
• Classification based on Information Science
• Classification based on utilized techniques
• Classification based on adapted applications

Classification Based on the mined Databases


A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models, types of
data, etc., which further assist in classifying a data mining system.
For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.

Classification Based on the type of Knowledge Mined


A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis

Classification Based on the Techniques Utilized


A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.

Classification Based on the Applications Adapted


Data mining systems classified based on adapted applications adapted are as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail

Examples of Classification Task


Following is some of the main examples of classification tasks:
• Classification helps in determining tumor cells as benign or malignant.
• Classification of credit card transactions as fraudulent or legitimate.
• Classification of secondary structures of protein as alpha-helix, beta-sheet, or random coil.
• Classification of news stories into distinct categories such as finance, weather,
entertainment, sports, etc.
Integration schemes of Database and Data warehouse systems

No Coupling
In no coupling schema, the data mining system does not use any database or data warehouse system
functions.
Loose Coupling
In loose coupling, data mining utilizes some of the database or data warehouse system
functionalities. It mainly fetches the data from the data repository managed by these systems and
then performs data mining. The results are kept either in the file or any designated place in the
database or data warehouse.
Semi-Tight Coupling
In semi-tight coupling, data mining is linked to either the DB or DW system and provides an
efficient implementation of data mining primitives within the database.

Tight coupling
A data mining system can be effortlessly combined with a database or data warehouse system
in tight coupling.

Data Mining Bayesian Classifiers


In numerous applications, the connection between the attribute set and the class variable is non-
deterministic. In other words, we can say the class label of a test record cant be assumed with
certainty even though its attribute set is the same as some of the training examples. These
circumstances may emerge due to the noisy data or the presence of certain confusing factors that
influence classification, but it is not included in the analysis. For example, consider the task of
predicting the occurrence of whether an individual is at risk for liver illness based on individuals
eating habits and working efficiency. Although most people who eat healthly and exercise
consistently having less probability of occurrence of liver disease, they may still do so due to other
factors. For example, due to consumption of the high-calorie street foods and alcohol abuse.
Determining whether an individual's eating routine is healthy or the workout efficiency is sufficient
is also subject to analysis, which in turn may introduce vulnerabilities into the leaning issue.
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The theory
expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to
provide an algorithm that uses evidence to calculate limits on an unknown parameter.
Bayes's theorem is expressed mathematically by the following equation that is given below

Where X and Y are the events and P (Y) ≠ 0


P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects
the degree of belief in a hypothesis before and after accounting for evidence. For example, Lets us
consider an example of the coin. If we toss a coin, then we get either heads or tails, and the percent
of occurrence of either heads and tails is 50%. If the coin is flipped numbers of times, and the
outcomes are observed, the degree of belief may rise, fall, or remain the same depending on the
outcomes.
For proposition X and evidence Y,
• P(X), the prior, is the primary degree of belief in X
• P(X/Y), the posterior is the degree of belief having accounted for Y.

• The quotient represents the supports Y provides for X.


Bayes theorem can be derived from the conditional probability
Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept. Generally
known as Belief Networks, Bayesian Networks are used to show uncertainties using Directed
Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection between
the nodes.

The nodes here represent random variables, and the edges define the relationship between these
variables
A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in a network

You might also like