Data Science Vijay1
Data Science Vijay1
Data Science Vijay1
UNIT IV
Machine Learning: Modeling, Over fitting and Under fitting, Correctness, The Bias-Variance
Tradeoff, Feature Extraction and Selection, k-Nearest Neighbors, Naive Bayes, Simple Linear
Regression, Multiple Regression, Digression, Logistic Regression
UNIT V
Clustering: The Idea, The Model, Choosing k, Bottom-Up Hierarchical Clustering. Recommender
Systems: Manual Curation, Recommending What’s Popular, User-Based
UNIT IV
Machine Learning:
Modeling,
A machine learning model is a program that can find patterns or
make decisions from a previously unseen dataset. For example, in
natural language processing, machine learning models can parse
and correctly recognize the intent behind previously unheard
sentences or combinations of words.
Hence, in simple words, we can say that a machine learning model is a simplified
representation of something or a process. In this topic, we will discuss different machine
learning models and their techniques and algorithms.
There are various types of machine learning models available based on different
business goals and data sets.
Classification of Machine Learning Models:
Based on different business goals and data sets, there are three learning models for
algorithms. Each machine learning algorithm settles into one of the three models:
o Supervised Learning
o Unsupervised Learning
o Reinforcement Learning
o Classification
o Regression
o Clustering
o Association Rule
o Dimensionality Reduction
Regression
In regression problems, the output is a continuous variable. Some commonly used
Regression models are as follows:
a) Linear Regression
Linear regression is the simplest machine learning model in which we try to predict
one output variable using one or more input variables. The representation of linear
regression is a linear equation, which combines a set of input values(x) and predicted
output(y) for the set of those input values. It is represented in the form of a line:
Y = bx+ c.
The main aim of the linear regression model is to find the best fit line that best fits the
data points.
Linear regression is extended to multiple linear regression (find a plane of best fit) and
polynomial regression (find the best fit curve).
b) Decision Tree
Decision trees are the popular machine learning models that can be used for both
regression and classification problems.
A decision tree uses a tree-like structure of decisions along with their possible
consequences and outcomes. In this, each internal node is used to represent a test on
an attribute; each branch is used to represent the outcome of the test. The more nodes
a decision tree has, the more accurate the result will be.
The advantage of decision trees is that they are intuitive and easy to implement, but
they lack accuracy.
c) Random Forest
Random Forest is the ensemble learning method, which consists of a large number of
decision trees. Each decision tree in a random forest predicts an outcome, and the
prediction with the majority of votes is considered as the outcome.
A random forest model can be used for both regression and classification problems.
For the classification task, the outcome of the random forest is taken from the majority
of votes. Whereas in the regression task, the outcome is taken from the mean or
average of the predictions generated by each tree.
d) Neural Networks
Neural networks are the subset of machine learning and are also known as artificial
neural networks. Neural networks are made up of artificial neurons and designed in a
way that resembles the human brain structure and working. Each artificial neuron
connects with many other neurons in a neural network, and such millions of connected
neurons create a sophisticated cognitive structure.
Neural networks consist of a multilayer structure, containing one input layer, one or
more hidden layers, and one output layer. As each neuron is connected with another
neuron, it transfers data from one layer to the other neuron of the next layers. Finally,
data reaches the last layer or output layer of the neural network and generates output.
Neural networks depend on training data to learn and improve their accuracy.
However, a perfectly trained & accurate neural network can cluster data quickly and
become a powerful machine learning and AI tool. One of the best-known neural
networks is Google's search algorithm.
Classification
Classification models are the second type of Supervised Learning techniques, which
are used to generate conclusions from observed values in the categorical form. For
example, the classification model can identify if the email is spam or not; a buyer will
purchase the product or not, etc. Classification algorithms are used to predict two
classes and categorize the output into different groups.
In classification, a classifier model is designed that classifies the dataset into different
categories, and each category is assigned a label.
a) Logistic Regression
Support vector machine or SVM is the popular machine learning algorithm, which is
widely used for classification and regression tasks. However, specifically, it is used to
solve classification problems. The main aim of SVM is to find the best decision
boundaries in an N-dimensional space, which can segregate data points into classes,
and the best decision boundary is known as Hyperplane. SVM selects the extreme
vector to find the hyperplane, and these vectors are known as support vectors.
c) Naïve Bayes
Naïve Bayes is another popular classification algorithm used in machine learning. It is
called so as it is based on Bayes theorem and follows the naïve(independent)
assumption between the features which is given as:
Each naïve Bayes classifier assumes that the value of a specific variable is independent
of any other variable/feature. For example, if a fruit needs to be classified based on
color, shape, and taste. So yellow, oval, and sweet will be recognized as mango. Here
each feature is independent of other features.
Unsupervised learning models are mainly used to perform three tasks, which are as
follows:
o Clustering
Clustering is an unsupervised learning technique that involves clustering or groping
the data points into different clusters based on similarities and differences. The objects
with the most similarities remain in the same group, and they have no or very few
similarities from other groups.
Clustering algorithms can be widely used in different tasks such as Image
segmentation, Statistical data analysis, Market segmentation, etc.
Some commonly used Clustering algorithms are K-means Clustering, hierarchal
Clustering, DBSCAN, etc.
Below are some popular algorithms that come under reinforcement learning:
It aims to learn the policy that can help the AI agent to take the best action for
maximizing the reward under a specific circumstance. It incorporates Q values for each
state-action pair that indicate the reward to following a given state path, and it tries
to maximize the Q-value.
The answer to this question is No, and the machine learning model is not the same as
an algorithm. In a simple way, an ML algorithm is like a procedure or method that
runs on data to discover patterns from it and generate the model. At the same time,
a machine learning model is like a computer program that generates output or
makes predictions. More specifically, when we train an algorithm with data, it
becomes a model.
Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally,
it makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start increasing, so
the point, just before the raising of errors, is the good point, and we can stop here for
achieving a good model.
There are two other methods by which we can get a good point for our model, which are
the resampling method to estimate model accuracy and validation dataset.
Correctness,
Accuracy is a metric that measures how often a machine learning
model correctly predicts the outcome. You can calculate accuracy
by dividing the number of correct predictions by the total number
of predictions. In other words, accuracy answers the question:
how often the model is right?
In Real-life situations, the problems that are in the modelling are rarely easy when
compared to other problems. We can work with such datasets that are imbalanced or
have multiclass or many classification problems. When we are doing any problem
using any method of machine learning, then always a high accuracy is not our main
goal. As you are able to solve more problems in ML, working out and utilizing precision
ends up being more suitable and requires additional thought.
Accuracy:
Accuracy is mainly used in classification problems to get the appropriate percentage
of the models being executed to know how much of it is correctly predicted. In any
model, there will be several classification problems will present when we execute them,
then the accuracy plays the main role by talking about whether all the problems are
correctly predicted or not with respect to the total number of predictions made by the
model. We can calculate the accuracy of any model by dividing the correctly predicted
problems by the total number of predictions made.
The above formula is very useful for calculating the accuracy of any model. It provides
a simple understanding of a binary classification problem.
Accuracy Paradox:
When we execute the model, the default form will give the accuracy of the overall
metric about the performance of the whole dataset.
However, the overall accuracy of the classification models in machine learning can be
misleading when the distribution of the problems to the specific class is imbalanced,
and at that point in time, it is very difficult to predict the class of the model correctly.
In this situation, the class with the most occurrence will be predicted correctly, and the
accuracy of this predicted class will be a high accuracy score, whereas the class with
the low occurrence will be misclassified. When this type of situation occurs in the
model, then there is a high probability that the predicted accuracy will be wrong, and
we cannot predict the performance of the model correctly.
For example, in any health issue prediction model, we cannot miss any harmful disease
cases by any changes in the files of the patients. If any of the files got changed due to
any issues, then the predictor will directly predict the condition based on the files or
classes. Then the person with any small health issue has to face serious treatment
because of the changes or misclassified classes.
Now let us take an example that uses the dataset of breast cancer, which is used to
classify the breast tumor cases
Before executing the model, we change the data that is imbalanced by removing the
data that is more harmful, so after removing all the harmful data from the dataset, the
accuracy will be around 5.6%.
1. tumour_data_imbalanced["labels"].value_counts(normalize = True)
Output:
Good 0.854698
Bad 0.0548726
Name: labels, dtype: float64
Output:
0.9854126
Our model accomplished a general accuracy of ~0.9464 for the entire model. This
outcome is, by all the data, strikingly great. However, on the off chance that we
investigate the class-level forecasts utilizing a disarray framework, we get a totally
different picture.
Our model misdiagnosed practically all difficult cases. The outcome is the very inverse
of what we anticipated in view of the general precision metric. The circumstance is a
commonplace illustration of the precision mystery. While you accomplish a high
exactness esteem, it gives you a bogus reason as your dataset is exceptionally
imbalanced, and misprediction of the minority class is exorbitant.
It does not make any difference in the event that your model accomplishes 99.99%
precision, assuming missing a solitary case is sufficient to disrupt the entire framework.
Depending on the exactness score as determined above is not sufficient and could
misdirect.
o Precision: precision is defined as the percentage of the correct predictions among the
total prediction of a class between all of the classes present in the dataset.
o Recall: Recall is defined as the proportion of the correct predictions among the total
prediction of a class between all of the classes present in the dataset.
o F-score: F-score is defined as a metric combination of precision and also the recall.
o Confusion matrix: A confusion matrix is defined as the tabular summary of true/false
or positive/negative prediction rates.
o ROC curve: ROC is defined as the binary classification of the diagnostic plot of the
curve.
In the accuracy classes, there are only two classes present. They are positive/negative:
o TP: TP stands for True positive, i.e., These are the classes that are correctly predicted,
and the correctly predicted classes are marked as positive.
o FP: FP stands for false positive, i.e., These are the classes that are falsely predicted, and
the falsely predicted classes are marked as positive.
o TN: TN stands for True positive, i.e., These are the classes that are correctly predicted,
and the correctly predicted classes are marked as negative.
o FN: FN stands for False negative, i.e., These are the classes that are falsely predicted,
and the falsely predicted classes are marked as negative.
o Now, we have to select the number of correct predictions in the table, and then
we have to find the accuracy of the model using the formula given above.
o Multiclass Accuracy = 7+5+3/31 = 0.483
o The above result of the formula says that our model achieved a 48% accuracy
in this class (i.e., multiclass classification problem).
o Accuracy in Multilabel Problems:
o In Multilabel classification problems, the classes of the dataset are mutually
exclusive to each other. Whereas multilabel classification is different from
multiclass problems because, in multiclass classification, the classes are
mutually non-exclusive to each other. In machine learning, we can represent
multilabel classification as multiple binary classification problems.
Example:
Output:
0.45827400
If the hamming score we get after the calculation is closer to one, then the
performance of the model will be good.
Hamming Loss:
Hamming loss is a method that is used to get the ratio of data that is predicted labels
wrongly. It will take the values in the range of 0 and 1, where the 0 is used to represent
no errors in the model.
Output:
0.03501485
If the hamming loss we get after the calculation is closer to one, then the performance of the
model will be good.
Other than these measurements, you can utilize the multilabel adaptation of similar
arrangement measurements you have found in the paired and multiclass case (e.g., accuracy,
review, F-score). You can likewise apply averaging methods (miniature, large-scale, and test-
based) or ranking-based measurements.
In Machine learning, we always work with a large number of labels and datasets.
Sometimes it takes more work to predict all of them correctly. But using the above
accuracy technique, ques we can find the accuracy very easily when compared with
others. The subset accuracy technique shows the low performance of the model
compared with other techniques.
This measurement does not give data about halfway accuracy due to the stringent
standard it depends on. On the off chance that our model neglects to foresee just a
solitary mark from the 103 but performs well on the rest, Subset Precision actually
orders these expectations as disappointments.
Assuming you really want to use the precision metric in your project or work, there are
exceptionally easy-to-utilize packages like deep checks that make sure that it is used
to give you the top to bottom reports on important measurements to assess your
model. This makes it simpler or easier for you to all the more likely understand your
model's performance.
The Bias-Variance Tradeoff,
o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.
regardless of which algorithm has been used. The cause of these errors is unknown
variables whose value can't be reduced.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies
them to test data for prediction. While making predictions, a difference occurs
between prediction values made by the model and actual values/expected
values, and this difference is known as bias errors or Errors due to bias. It can be
defined as an inability of machine learning algorithms such as Linear Regression to
capture the true relationship between the data points. Each algorithm begins with
some amount of bias because bias occurs from assumptions in the model, which
makes the target function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of the target
function.
o High Bias: A model with a high bias makes more assumptions, and the model becomes
unable to capture the important features of our dataset. A high bias model also
cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler
the algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear
algorithm often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an algorithm
with high bias is Linear Regression, Linear Discriminant Analysis and Logistic
Regression.
Low variance means there is a small variation in the prediction of the target function
with changes in the training data set. At the same time, High variance shows a large
variation in the prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training
datvvaset, and does not generalize well with the unseen dataset. As a result, such a
model gives good results with the training dataset but shows high error rates on the
test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:
o A high variance model leads to overfitting.
o Increase model complexities.
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance.
Some examples of machine learning algorithms with low variance are, Linear
Regression, Logistic Regression, and Linear discriminant analysis. At the same
time, algorithms with high variance are decision tree, Support Vector Machine, and
K-nearest neighbours.
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias
and variance in order to avoid overfitting and underfitting in the model. If the model
is very simple with fewer parameters, it may have low variance and high bias. Whereas,
if the model has a large number of parameters, it will have high variance and low bias.
So, it is required to make a balance between bias and variance errors, and this balance
between the bias error and variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias.
But this is not possible because bias and variance are related to each other:
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a
balance between bias and variance errors.
v
Feature Extraction and Selection,
Feature selection and feature extraction are both techniques used in machine
learning to improve the performance of models. However, they approach this goal in
fundamentally different ways:
Feature Selection:
What it does: Selects a subset of the original features that are most relevant to the
prediction task. It's like picking the most informative ingredients from a recipe.
Benefits: Reduces training time and computational cost by focusing on the most
important features. Improves model interpretability by highlighting the key factors
influencing predictions. Can help to avoid overfitting by reducing the dimensionality
of the data.
Challenges: Choosing the right selection method depends on the data and the
problem. Might discard potentially useful information if the selection process isn't
careful.
Feature Extraction:
What it does: Creates entirely new features from the original data. It's like
combining ingredients in a new way to create a more flavorful dish.
Benefits: Can capture hidden patterns or relationships within the data that might not
be evident in the original features. May lead to more accurate and robust models by
representing the data in a more informative way.
Challenges: Designing effective feature extraction methods requires domain
knowledge and expertise. Can increase the dimensionality of the data, potentially
leading to overfitting if not addressed.
Here's a table summarizing the key differences:
The choice between feature selection and feature extraction depends on the specific
problem and data:
It's important to evaluate the impact of both approaches on your machine learning
model's performance to determine the best course of action.
Feature selection is one of the important concepts of machine learning, which highly
impacts the performance of the model. As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to input the most appropriate and
relevant dataset to the model in order to get a better result.
In this topic, we will discuss different feature selection techniques for machine learning.
But before that, let's first understand some basics of feature selection.
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.
Selecting the best features helps the model to perform well. For example, Suppose we
want to create a model that automatically decides which car should be crushed for a
spare part, and to do this, we have a dataset. This dataset contains a Model of the car,
Year, Owner's name, Miles. So, in this dataset, the name of the owner does not
contribute to the model performance as it does not decide if the car should be crushed
or not, so we can remove this column and select the rest of the features(column) for
the model building.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with
this feature set, the model has trained again.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant feature and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and
does not overfit the data.
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Information Gain: Information gain determines the reduction in entropy while
transforming the dataset. It can be used as a feature selection technique by calculating
the information gain of each variable with respect to the target variable.
Fisher's Score:
The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number
of missing values in each column divided by the total number of observations. The
variable is having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are
fast processing methods similar to the filter method but more accurate than the filter
method.
These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:
To know this, we need to first identify the type of input and output variables. In
machine learning, variables are of mainly two types:
Below are some univariate statistical measures, which can be used for filter-based
feature selection:
Numerical Input variables are used for predictive regression modelling. The common
method to be used for such a case is the Correlation coefficient.
The commonly used technique for such a case is Chi-Squared Test. We can also use
Information gain in this case.
We can summarise the above cases with appropriate measures in the below
table:
Numerical Numerical
o Pearson's correlation coefficient (For linear Correlation).
o Spearman's rank coefficient (for non-linear correlation).
Numerical Categorical
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
Categorical Numerical
o Kendall's rank coefficient (linear).
o ANOVA correlation coefficient (nonlinear).
Categorical Categorical
o Chi-Squared test (contingency tables).
o Mutual Information.
Conclusion
Feature selection is a very complicated and vast field of machine learning, and lots of
studies are already made to discover the best methods. There is no fixed rule of the
best feature selection method. However, choosing the method depend on a machine
learning engineer who can combine and innovate approaches to find the best method
for a specific problem. One should try a variety of model fits on different subsets of
features selected through different statistical Measures.
k-Nearest Neighbors,
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who are
interested in buying that SUV. So for this problem, we have a dataset that contains
multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below is the
dataset:
Steps to implement the K-NN algorithm:
The Data Pre-processing step will remain exactly the same as Logistic Regression.
Below is the code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-
processed. After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
o Fitting K-NN classifier to the Training data:
Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing
the class, we will create the Classifier object of the class. The Parameter of this
class will be
o n_neighbors: To define the required neighbors of the algorithm. Usually,
it takes 5.
o metric='minkowski': This is the default parameter and it decides the
distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.
And then we will fit the classifier to the training data. Below is the code for it:
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:
Output:
In above code, we have imported the confusion_matrix function and called it using the
variable cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7
incorrect predictions, whereas, in Logistic Regression, there were 11 incorrect predictions.
So we can say that the performance of the model is improved by using the K-NN
algorithm.
Output:
The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:
o As we can see the graph is showing the red point and green points. The
green points are for Purchased(1) and Red Points for not Purchased(0)
variable.
o The graph is showing an irregular boundary instead of showing any
straight line or any curve because it is a K-NN algorithm, i.e., finding the
nearest neighbor.
o The graph has classified users in the correct categories as most of the
users who didn't buy the SUV are in the red region and users who bought
the SUV are in the green region.
o The graph is showing good result but still, there are some green points
in the red region and red points in the green region. But this is no big
issue as by doing this model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new
dataset, i.e., Test dataset. Code remains the same except some minor changes:
such as x_train and y_train will be replaced by x_test and y_test.
Below is the code for it:
Output:
The above graph is showing the output for the test data set. As we can see in the graph,
the predicted output is well good as most of the red points are in the red region and most
of the green points are in the green region.
However, there are few green points in the red region and a few red points in the green
region. So these are the incorrect observations that we have observed in the confusion
matrix(7 Incorrect output).
Naive Bayes,
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
ADVERTISEMENT
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Overcast 0 5 5/14=
Rainy 2 2 4/14=
Sunny 2 3 5/14=
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.
Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set,
and then we have scaled the feature variable.
In the above code, we have used the GaussianNB classifier to fit it to the training
dataset. We can also use other classifiers as per our requirement.
Output:
Output:
The above output shows the result for prediction vector y_pred and real vector y_test.
We can see that some predications are different from the real values, which are the
incorrect predictions.
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.
Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data
points with the fine boundary. It is Gaussian curve as we have
used GaussianNB classifier in our code.
Output:
The above output is final output for test set data. As we can see the classifier has
created a Gaussian curve to divide the "purchased" and "not purchased" variables.
There are some wrong predictions which we have calculated in Confusion matrix. But
still it is pretty good classifier.
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on
continuous or categorical values.
ADVERTISEMENT
ADVERTISEMENT
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing
or decreasing.
ε = The error term. (For a good model it will be negligible)
Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best
fitting line for representing the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python,
we need to follow the below steps:
The first step for creating the Simple Linear Regression model is data pre-processing.
We have already done it earlier in this tutorial. But there will be some changes, which
are given in the below steps:
o First, we will import the three important libraries, which will help us for loading the
dataset, plotting the graphs, and creating the Simple Linear Regression model.
1. import numpy as nm
2. import matplotlib.pyplot as mtp
3. import pandas as pd
o Next, we will load the dataset into our code:
1. data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.
Note: In Spyder IDE, the folder containing the code file must be saved as a working
directory, and the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from the given
dataset. The independent variable is years of experience, and the dependent variable
is salary. Below is code for it:
1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as a
parameter, since we want to extract the second column and indexing starts from the
zero.
By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.
o Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10 observations
for the test set. We are splitting our dataset so that we can train our model using a
training dataset and then test the model using a test dataset. The code for this is given
below:
By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:
Test-dataset:
Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries
take care of it for some cases, so we don't need to perform it here. Now, our dataset is
well prepared to work on it and we are going to start building a Simple Linear
Regression model for the given problem.
Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After
importing the class, we are going to create an object of the class named as a regressor.
The code for this is given below:
Output:
dependent (salary) and an independent variable (Experience). So, now, our model is
ready to predict the output for the new observations. In this step, we will provide the
test dataset (new observations) to the model to check whether it can predict the
correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions
of test dataset, and prediction of training set respectively.
On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the training
set and test set.
Output:
You can check the variable by clicking on the variable explorer option in the IDE, and
also compare the result by comparing values from y_pred and y_test. By comparing
these values, we can check how good our model is performing.
Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary
of employees. In the function, we will pass the real values of training set, which means
a year of experience x_train, training set of Salaries y_train, and color of the
observations. Here we are taking a green color for the observation, but it can be any
color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of
the pyplot library. In this function, we will pass the years of experience for training set,
predicted salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".
ADVERTISEMENT
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.
Finally, we will represent all above things in a graph using show(). The code is given
below:
Output:
By executing the above lines of code, we will get the below graph plot as an output.
ADVERTISEMENT
In the above plot, we can see the real values observations in green dots and predicted
values are covered by the red regression line. The regression line shows a correlation
between the dependent and independent variable.
The good fit of the line can be observed by calculating the difference between actual
values and predicted values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our model is good for the
training set.
ADVERTISEMENT
In the previous step, we have visualized the performance of our model on the training
set. Now, we will do the same for the Test set. The complete code will remain the same
as the above code, except in this, we will use x_test, and y_test instead of x_train and
y_train.
Here we are also changing the color of observations and regression line to differentiate
between the two plots, but it is optional.
Output:
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is
given by the red regression line. As we can see, most of the observations are close to
the regression line, hence we can say our Simple Linear Regression is a good model
and able to make good predictions.
Multiple Regression,
Multiple Linear Regression
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there
may be various cases in which the response variable is affected by more than one
predictor variable; for such cases, the Multiple Linear Regression algorithm is used.
Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear
Regression, so the same is applied for the multiple linear regression equation, the
equation becomes:
1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+
b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
Problem Description:
Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the MLR
model:
The very first step is data pre-processing, which we have already discussed in this
tutorial. This process contains the below steps:
o Importing libraries: Firstly we will import the library which will help in building the
model. Below is the code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which contains all
the variables. Below is the code for it:
1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')
In above output, we can clearly see that there are five variables, in which four variables
are continuous and one is categorical variable.
Output:
Out[5]:
As we have one categorical variable (State), which cannot be directly applied to the
model, so we will encode it. To encode the categorical variable into numbers, we will
use the LabelEncoder class. But it is not sufficient because it still has some relational
order, which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:
1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other variables
are continuous.
Output:
As we can see in the above output, the state column has been converted into dummy
variables (0 and 1). Here each dummy variable column is corresponding to the one
State. We can check by comparing it with the original dataset. The first column
corresponds to the California State, the second column corresponds to the Florida
State, and the third column corresponds to the New York State.
Note: We should not use all the dummy variables at the same time, so it must be 1
less than the total number of dummy variables, else it will create a dummy variable
trap.
o Now, we are writing a single line of code just to avoid the dummy variable trap:
If we do not remove the first dummy variable, then it may introduce multicollinearity
in the model.
As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is given below:
The above code will split our dataset into a training set and test set.
Output: The above code will split the dataset into training set and test set. You can
check the output by clicking on the variable explorer option given in Spyder IDE. The
test set and training set will look like the below image:
Test set:
Training set:
ADVERTISEMENT
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we
don't need to do it manually.
Step: 2- Fitting our MLR model to the Training set:
Now, we have well prepared our dataset in order to provide training, which means we
will fit our regression model to the training set. It will be similar to as we did in Simple
Linear Regression model. The code for this will be:
Output:
By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test
set values.
Output:
In the above output, we have predicted result set and test set. We can check model
performance by comparing these two value index by index. For example, the first index
has a predicted value of 103015$ profit and test/real value of 103282$ profit. The
difference is only of 267$, which is a good prediction, so, finally, our model is
completed here.
o We can also check the score for training dataset and test dataset. Below is the code for
it:
The above score tells that our model is 95% accurate with the training dataset
and 93% accurate with the test dataset.
ADVERTISEMENT
Note: In the next topic, we will see how we can improve the performance of the
model using the Backward Elimination process.
Applications of Multiple Linear Regression:
There are mainly two applications of Multiple Linear Regression:
Digression,
We introduce a novel measure, called digression, to assess the
value- disclosure risk in constructing regression trees for data
partitioning; specifically, an algorithm is developed that uses the
measure for pruning the tree to limit disclosure of sensitive data.
Logistic Regression
UNIT V
Clustering: The Idea,
The Model,
Choosing k,
Bottom-Up Hierarchical Clustering.
Recommender Systems: Manual Curation,
Recommending What’s Popular,
User-Based
Collaborative Filtering,
Item-Based Collaborative Filtering,
Matrix Factorization Data Ethics,
Building Bad Data Products,
Trading Off Accuracy and Fairness,
Collaboration,
Interpretability,
Recommendations,
Biased Data,
Data Protection IPython,
Mathematics,
NumPy,
pandas,
scikit-learn,
Visualization,R