Practical # 11
Practical # 11
Practical # 11
Classification example
Suppose you are a product manager, you want to classify customer reviews in positive and
negative classes. Or As a loan manager, you want to identify which loan applicants are safe
or risky? As a healthcare analyst, you want to predict which patients can suffer from diabetes
disease. All the examples have the same kind of problem to classify reviews, loan applicants,
and patients.
Naïve bayes
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable
for a large chunk of data.
Naive Bayes classifier is successfully used in various applications such as spam filtering, text
classification, sentiment analysis, and recommender systems. It uses Bayes theorem of
probability for prediction of unknown class.
Classification Workflow
Whenever you perform classification, the first step is to understand the problem and identify
potential features and label. Features are those characteristics or attributes which affect the
results of the label.
For example, in the case of a loan distribution, bank manager's identify customer’s
occupation, income, age, location, previous loan history, transaction history, and credit score.
These characteristics are known as features which help the model classify customers.
Classification phases
The classification has two phases, a learning phase, and the evaluation phase.
In the learning phase, classifier trains its model on a given dataset and in the evaluation
phase, it tests the classifier performance. Performance is evaluated on the basis of various
parameters such as accuracy, error, precision, and recall.
P(h): the probability of hypothesis h being true (regardless of the data). This is known as
the prior probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as
posterior probability.
Importing libraries and loading dataset
1. Data Splitting: The data is split into training and testing sets using the
train_test_split function from Scikit-learn's model_selection module. The
parameters test_size=0.2 and random_state=0 are specified, which means that 20%
of the data is set aside for testing the model, and the split is reproducible (the same split
will occur each time the code is run due to the fixed random state).
2. Data Preparation: The training data (X_train) is then used to create a pandas
DataFrame for easy manipulation or inspection. The target variable (y_train), which
contains the class labels, is added to this DataFrame. This DataFrame isn’t actually used
in the model training but is typically useful for data exploration or checking that the data
has been loaded correctly.
3. Model Training:
o A Gaussian Naive Bayes model instance is created.
o The model is then fitted to the training data using the fit method, which takes
X_train (features) and y_train (target labels) as inputs. This step is where the
actual training happens. The Gaussian Naive Bayes algorithm estimates the
parameters of a Gaussian distribution for each class of the target variable.
o The fit method adjusts the model parameters (mean and variance for each
feature within each class) based on the provided data, effectively "learning" from
the data.
4. Model Output: After the training, the model can be used to make predictions on new
data or evaluate its performance on the test set. The print(model) statement will only
display the model's configuration, showing the parameters with which the model was
initialized.
This training process is a typical workflow for supervised learning, where a model learns to map
inputs (features) to outputs (target labels) based on provided data. Once trained, the model can
predict the class label of new, unseen instances based on the learned parameters.
Function: This line uses the trained model (presumably a Gaussian Naive Bayes model
from previous context) to predict the labels of the test dataset X_test.
Output: The predicted variable stores the predicted class labels for each instance in the
X_test dataset. These predictions are based on the patterns the model learned during
training
Function: This block constructs a pandas DataFrame that includes both the features of
the test instances (sepal length, sepal width, petal length, petal width) and the
actual and predicted labels (Actual Value and Predicted Values). The slicing
(X_test[:,i]) extracts all rows of the column i from the NumPy array X_test.
Purpose: The DataFrame is useful for visually comparing the predictions against the
actual values to get a quick sense of how well the model is performing. Each row
represents a sample from the test set with its features, actual class, and predicted class.
Predict the outcomes of the test data, visualizing test data and analyze
performance
Actual and predicted values
Classification metrics
o Accuracy - Accuracy is the most intuitive performance measure and it is simply a
ratio of correctly predicted observation to the total observations. If we have high
accuracy then our model is best.
Accuracy = TP+TN/TP+FP+FN+TN
o True Positive: means that the value of actual class is yes and the value of
predicted class is also yes. E.g. if actual class value indicates that this passenger
survived and predicted class tells you the same thing.
o True Negatives (TN) - which means that the value of actual class is no and value
of predicted class is also no. E.g. if actual class says this passenger did not survive
and predicted class tells you the same thing.
o False Positives (FP) – When actual class is no and predicted class is yes. E.g. if
actual class says this passenger did not survive but predicted class tells you that
this passenger will survive.
o False Negatives (FN) – When actual class is yes but predicted class in no. E.g. if
actual class value indicates that this passenger survived and predicted class tells
you that passenger will die.
Class Tasks
Submission Date: --
Take 2 datasets of your own choice and apply Naïve Bayes and SVM algorithm on both
datasets and show their comparisons in results.