Classification
Classification
Classification
different forms to gain patterns, and to gain knowledge on that pattern. In the process of data
mining, large data sets are first sorted, then patterns are identified and relationships are established
to perform data analysis and solve problems.
Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a set
of categories (subpopulations), a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further
approve it. It is a two-step process such as
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model on
test data and hence estimate the accuracy of the classification rules.
Test data are used to estimate the accuracy of the classification rule
Training and Testing:
Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should get
aside in order not to get hurt. So, this is his training part to move away. While Testing if the person
sees any heavy object coming towards him or falling on him and moves aside then the system is
tested positively and if the person does not move aside then the system is negatively tested.
The same is the case with the data, it should be trained in order to get the accurate and best results.
There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format)
1. Discriminative: It is a very basic classifier and determines just one class for each row of
data. It tries to model just by depending on the observed data, depends heavily on the quality
of data rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the model that
generates the data behind the scenes by estimating assumptions and distributions of the
model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too
divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if
a user wants to check that if an email contains the word cheap, then that may be termed as
Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging
based on in-house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from
various devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view
and download in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output values.
The algorithm derives the model or a predictor according to the training dataset. The model should
find a numerical output when the new data is given. Unlike in classification, this method does not
have a class label. The model predicts a continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts
such as the number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular customer will
spend at his company during a sale. We are bothered to forecast a numerical value in this case.
Therefore, an example of numeric prediction is the data processing activity. In this case, a model or
a predictor will be developed that forecasts a continuous or ordered value function.
NOTE: Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Classification Prediction
Classification is the process of identifying which Predication is the process of identifying the
category a new observation belongs to based on a missing or unavailable numerical data for a
training data set containing observations whose
new observation.
category membership is known.
In prediction, the accuracy depends on how
In classification, the accuracy depends on finding the
well a given predictor can guess the value of
class label correctly.
a predicated attribute for new data.
In classification, the model can be known as the In prediction, the model can be known as
classifier. the predictor.
A model or a predictor will be constructed
A model or the classifier is constructed to find the
that predicts a continuous-valued function
categorical labels.
or ordered value.
For example, We can think of prediction as
For example, the grouping of patients based on their
predicting the correct treatment for a
medical records can be considered a classification.
particular disease for a person.
No Coupling
In no coupling schema, the data mining system does not use any database or data warehouse system
functions.
Loose Coupling
In loose coupling, data mining utilizes some of the database or data warehouse system
functionalities. It mainly fetches the data from the data repository managed by these systems and
then performs data mining. The results are kept either in the file or any designated place in the
database or data warehouse.
Semi-Tight Coupling
In semi-tight coupling, data mining is linked to either the DB or DW system and provides an
efficient implementation of data mining primitives within the database.
Tight coupling
A data mining system can be effortlessly combined with a database or data warehouse system
in tight coupling.
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept. Generally
known as Belief Networks, Bayesian Networks are used to show uncertainties using Directed
Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection between
the nodes.
The nodes here represent random variables, and the edges define the relationship between these
variables
A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in a network