Practical # 11

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Department of Software Engineering

Mehran University of Engineering and Technology, Jamshoro

Course: SWE – Data Analytics and Business Intelligence


Instructor Ms Sana Faiz Practical/Lab No. 11
Date CLOs 04
Signature Assessment Score

Topic To understand basics of python


Objectives To become familiar with supervised learning: naïve bayes and support
vector machines using scikit

Lab Discussion: Theoretical concepts and Procedural steps

Classification example
 Suppose you are a product manager, you want to classify customer reviews in positive and
negative classes. Or As a loan manager, you want to identify which loan applicants are safe
or risky? As a healthcare analyst, you want to predict which patients can suffer from diabetes
disease. All the examples have the same kind of problem to classify reviews, loan applicants,
and patients.
Naïve bayes
 Naive Bayes is the most straightforward and fast classification algorithm, which is suitable
for a large chunk of data.
 Naive Bayes classifier is successfully used in various applications such as spam filtering, text
classification, sentiment analysis, and recommender systems. It uses Bayes theorem of
probability for prediction of unknown class.
Classification Workflow

 Whenever you perform classification, the first step is to understand the problem and identify
potential features and label. Features are those characteristics or attributes which affect the
results of the label.
 For example, in the case of a loan distribution, bank manager's identify customer’s
occupation, income, age, location, previous loan history, transaction history, and credit score.
These characteristics are known as features which help the model classify customers.
Classification phases
 The classification has two phases, a learning phase, and the evaluation phase.
 In the learning phase, classifier trains its model on a given dataset and in the evaluation
phase, it tests the classifier performance. Performance is evaluated on the basis of various
parameters such as accuracy, error, precision, and recall.

What is Naive Bayes Classifier?


 Naive Bayes classifier assumes that the effect of a particular feature in a class is independent
of other features. For example, a loan applicant is desirable or not depending on his/her
income, previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class
conditional independence.

Naïve Bayes classifier

P(h): the probability of hypothesis h being true (regardless of the data). This is known as
the prior probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as
posterior probability.
Importing libraries and loading dataset

Visualizing loaded data

Training and building the classifier

1. Data Splitting: The data is split into training and testing sets using the
train_test_split function from Scikit-learn's model_selection module. The
parameters test_size=0.2 and random_state=0 are specified, which means that 20%
of the data is set aside for testing the model, and the split is reproducible (the same split
will occur each time the code is run due to the fixed random state).
2. Data Preparation: The training data (X_train) is then used to create a pandas
DataFrame for easy manipulation or inspection. The target variable (y_train), which
contains the class labels, is added to this DataFrame. This DataFrame isn’t actually used
in the model training but is typically useful for data exploration or checking that the data
has been loaded correctly.
3. Model Training:
o A Gaussian Naive Bayes model instance is created.
o The model is then fitted to the training data using the fit method, which takes
X_train (features) and y_train (target labels) as inputs. This step is where the
actual training happens. The Gaussian Naive Bayes algorithm estimates the
parameters of a Gaussian distribution for each class of the target variable.
o The fit method adjusts the model parameters (mean and variance for each
feature within each class) based on the provided data, effectively "learning" from
the data.
4. Model Output: After the training, the model can be used to make predictions on new
data or evaluate its performance on the test set. The print(model) statement will only
display the model's configuration, showing the parameters with which the model was
initialized.

This training process is a typical workflow for supervised learning, where a model learns to map
inputs (features) to outputs (target labels) based on provided data. Once trained, the model can
predict the class label of new, unseen instances based on the learned parameters.

Visualizing training data

Predicting the outcomes and performance analysis


1. Making Predictions
predicted = model.predict(X_test)

 Function: This line uses the trained model (presumably a Gaussian Naive Bayes model
from previous context) to predict the labels of the test dataset X_test.
 Output: The predicted variable stores the predicted class labels for each instance in the
X_test dataset. These predictions are based on the patterns the model learned during
training

Creating a DataFrame for Comparison


x = pd.DataFrame({
'sepal length': X_test[:,0],
'sepal width': X_test[:,1],
'petal length': X_test[:,2],
'petal width': X_test[:,3],
'Actual Value': y_test,
'Predicted Values': predicted
})
print(x)

 Function: This block constructs a pandas DataFrame that includes both the features of
the test instances (sepal length, sepal width, petal length, petal width) and the
actual and predicted labels (Actual Value and Predicted Values). The slicing
(X_test[:,i]) extracts all rows of the column i from the NumPy array X_test.
 Purpose: The DataFrame is useful for visually comparing the predictions against the
actual values to get a quick sense of how well the model is performing. Each row
represents a sample from the test set with its features, actual class, and predicted class.

3. Summarizing the Model's Performance


print(metrics.classification_report(y_test, predicted))
print(metrics.confusion_matrix(y_test, predicted))
 Function: These lines use functions from sklearn.metrics to generate a classification
report and a confusion matrix.
o Classification Report: This report includes key metrics for each class such as
precision (the accuracy of positive predictions), recall (the ability of the classifier
to find all positive samples), f1-score (a weighted average of precision and recall),
and the support (the number of true instances for each label). It gives a detailed
summary of the model's performance per class and overall.
o Confusion Matrix: This is a table used to describe the performance of a
classification model on a set of test data for which the true values are known. It
tabulates the number of instances misclassified or correctly classified into each
class. Each row of the matrix represents the instances in an actual class, while
each column represents the instances in a predicted class.

Actual and predicted values

Performance analysis and confusion matrix

Support vector machine


 SVM offers very high accuracy compared to other classifiers such as logistic regression, and
decision trees. It is used in a variety of applications such as face detection, intrusion
detection, classification of emails, news articles and web pages, classification of genes, and
handwriting recognition.
 SVM is an exciting algorithm and the concepts are relatively simple. The classifier separates
data points using a hyperplane with the largest amount of margin. That's why an SVM
classifier is also known as a discriminative classifier. SVM finds an optimal hyperplane
which helps in classifying new data points.
 Generally, Support Vector Machines is considered to be a classification approach, it but can
be employed in both types of classification and regression problems. It can easily handle
multiple continuous and categorical variables.
 SVM constructs a hyperplane in multidimensional space to separate different classes.
 SVM generates optimal hyperplane in an iterative manner, which is used to minimize an
error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best
divides the dataset into classes.

How does SVM work?


 The main objective is to segregate the given dataset in the best possible way. The distance
between the either nearest points is known as the margin. The objective is to select a
hyperplane with the maximum possible margin between support vectors in the given dataset.
SVM searches for the maximum marginal hyperplane in the following steps:
 Generate hyperplanes which segregates the classes in the best way. Left-hand side figure
showing three hyperplanes black, blue and orange. Here, the blue and orange have higher
classification error, but the black is separating the two classes correctly.
 Select the right hyperplane with the maximum segregation from the either nearest data points
as shown in the right-hand side figure.
Importing libraries and loading dataset

Training the classifier and fit the dataset to the model

Predict the outcomes of the test data, visualizing test data and analyze
performance
Actual and predicted values

Performance analysis and confusion matrix

Classification metrics
o Accuracy - Accuracy is the most intuitive performance measure and it is simply a
ratio of correctly predicted observation to the total observations. If we have high
accuracy then our model is best.
 Accuracy = TP+TN/TP+FP+FN+TN
o True Positive: means that the value of actual class is yes and the value of
predicted class is also yes. E.g. if actual class value indicates that this passenger
survived and predicted class tells you the same thing.
o True Negatives (TN) - which means that the value of actual class is no and value
of predicted class is also no. E.g. if actual class says this passenger did not survive
and predicted class tells you the same thing.
o False Positives (FP) – When actual class is no and predicted class is yes. E.g. if
actual class says this passenger did not survive but predicted class tells you that
this passenger will survive.
o False Negatives (FN) – When actual class is yes but predicted class in no. E.g. if
actual class value indicates that this passenger survived and predicted class tells
you that passenger will die.

o Precision - Precision is the ratio of correctly predicted positive observations to


the total predicted positive observations. The question that this metric answer is of
all passengers that labeled as survived, how many actually survived? High
precision relates to the low false positive rate.
 Precision = TP/TP+FP
o Recall (Sensitivity) - Recall is the ratio of correctly predicted positive
observations to the all observations in actual class - yes. The question recall
answers is: Of all the passengers that truly survived, how many did we label?
 Recall = TP/TP+FN
o F1 score - F1 Score is the weighted average of Precision and Recall. Therefore,
this score takes both false positives and false negatives into account. Intuitively it
is not as easy to understand as accuracy. Accuracy works best if false positives
and false negatives have similar cost. If the cost of false positives and false
negatives are very different, it’s better to look at both Precision and Recall.
 F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Class Tasks
Submission Date: --

Take 2 datasets of your own choice and apply Naïve Bayes and SVM algorithm on both
datasets and show their comparisons in results.

You might also like