Data Science Vijay1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

Data science

UNIT IV
Machine Learning: Modeling, Over fitting and Under fitting, Correctness, The Bias-Variance

Tradeoff, Feature Extraction and Selection, k-Nearest Neighbors, Naive Bayes, Simple Linear
Regression, Multiple Regression, Digression, Logistic Regression
UNIT V
Clustering: The Idea, The Model, Choosing k, Bottom-Up Hierarchical Clustering. Recommender
Systems: Manual Curation, Recommending What’s Popular, User-Based

Collaborative Filtering, Item-Based Collaborative Filtering, Matrix Factorization Data Ethics,


Building Bad Data Products, Trading Off Accuracy and Fairness, Collaboration, Interpretability,
Recommendations, Biased Data, Data Protection IPython, Mathematics, NumPy, pandas, scikit-
learn, Visualization,R

UNIT IV
Machine Learning:
Modeling,
A machine learning model is a program that can find patterns or
make decisions from a previously unseen dataset. For example, in
natural language processing, machine learning models can parse
and correctly recognize the intent behind previously unheard
sentences or combinations of words.

Machine Learning Models


A machine learning model is defined as a mathematical representation of the
output of the training process. Machine learning is the study of different algorithms
that can improve automatically through experience & old data and build the model. A
machine learning model is similar to computer software designed to recognize
patterns or behaviors based on previous experience or data. The learning algorithm
discovers patterns within the training data, and it outputs an ML model which captures
these patterns and makes predictions on new data.
Let's understand an example of the ML model where we are creating an app to
recognize the user's emotions based on facial expressions. So, creating such an app is
possible by Machine learning models where we will train a model by feeding images
of faces with various emotions labeled on them. Whenever this app is used to
determine the user's mood, it reads all fed data then determines any user's mood.

Hence, in simple words, we can say that a machine learning model is a simplified
representation of something or a process. In this topic, we will discuss different machine
learning models and their techniques and algorithms.

What is Machine Learning Model?


Machine Learning models can be understood as a program that has been trained to
find patterns within new data and make predictions. These models are represented as
a mathematical function that takes requests in the form of input data, makes
predictions on input data, and then provides an output in response. First, these models
are trained over a set of data, and then they are provided an algorithm to reason over
data, extract the pattern from feed data and learn from those data. Once these models
get trained, they can be used to predict the unseen dataset.

There are various types of machine learning models available based on different
business goals and data sets.
Classification of Machine Learning Models:
Based on different business goals and data sets, there are three learning models for
algorithms. Each machine learning algorithm settles into one of the three models:

o Supervised Learning
o Unsupervised Learning
o Reinforcement Learning

Supervised Learning is further divided into two categories:

o Classification
o Regression

Unsupervised Learning is also divided into below categories:

o Clustering
o Association Rule
o Dimensionality Reduction

1. Supervised Machine Learning Models


Supervised Learning is the simplest machine learning model to understand in which
input data is called training data and has a known label or result as an output. So, it
works on the principle of input-output pairs. It requires creating a function that can be
trained using a training data set, and then it is applied to unknown data and makes
some predictive performance. Supervised learning is task-based and tested on labeled
data sets.
We can implement a supervised learning model on simple real-life problems. For
example, we have a dataset consisting of age and height; then, we can build a
supervised learning model to predict the person's height based on their age.

Supervised Learning models are further classified into two categories:

Regression
In regression problems, the output is a continuous variable. Some commonly used
Regression models are as follows:

a) Linear Regression

Linear regression is the simplest machine learning model in which we try to predict
one output variable using one or more input variables. The representation of linear
regression is a linear equation, which combines a set of input values(x) and predicted
output(y) for the set of those input values. It is represented in the form of a line:

Y = bx+ c.

The main aim of the linear regression model is to find the best fit line that best fits the
data points.

Linear regression is extended to multiple linear regression (find a plane of best fit) and
polynomial regression (find the best fit curve).

b) Decision Tree
Decision trees are the popular machine learning models that can be used for both
regression and classification problems.

A decision tree uses a tree-like structure of decisions along with their possible
consequences and outcomes. In this, each internal node is used to represent a test on
an attribute; each branch is used to represent the outcome of the test. The more nodes
a decision tree has, the more accurate the result will be.

The advantage of decision trees is that they are intuitive and easy to implement, but
they lack accuracy.

Decision trees are widely used in operations research, specifically in decision


analysis, strategic planning, and mainly in machine learning.

c) Random Forest

Random Forest is the ensemble learning method, which consists of a large number of
decision trees. Each decision tree in a random forest predicts an outcome, and the
prediction with the majority of votes is considered as the outcome.

A random forest model can be used for both regression and classification problems.

For the classification task, the outcome of the random forest is taken from the majority
of votes. Whereas in the regression task, the outcome is taken from the mean or
average of the predictions generated by each tree.

d) Neural Networks

Neural networks are the subset of machine learning and are also known as artificial
neural networks. Neural networks are made up of artificial neurons and designed in a
way that resembles the human brain structure and working. Each artificial neuron
connects with many other neurons in a neural network, and such millions of connected
neurons create a sophisticated cognitive structure.
Neural networks consist of a multilayer structure, containing one input layer, one or
more hidden layers, and one output layer. As each neuron is connected with another
neuron, it transfers data from one layer to the other neuron of the next layers. Finally,
data reaches the last layer or output layer of the neural network and generates output.

Neural networks depend on training data to learn and improve their accuracy.
However, a perfectly trained & accurate neural network can cluster data quickly and
become a powerful machine learning and AI tool. One of the best-known neural
networks is Google's search algorithm.

Classification
Classification models are the second type of Supervised Learning techniques, which
are used to generate conclusions from observed values in the categorical form. For
example, the classification model can identify if the email is spam or not; a buyer will
purchase the product or not, etc. Classification algorithms are used to predict two
classes and categorize the output into different groups.

In classification, a classifier model is designed that classifies the dataset into different
categories, and each category is assigned a label.

There are two types of classifications in machine learning:


o Binary classification: If the problem has only two possible classes, called a binary
classifier. For example, cat or dog, Yes or No,
o Multi-class classification: If the problem has more than two possible classes, it is a
multi-class classifier.

Some popular classification algorithms are as below:

a) Logistic Regression

Logistic Regression is used to solve the classification problems in machine learning.


They are similar to linear regression but used to predict the categorical variables. It can
predict the output in either Yes or No, 0 or 1, True or False, etc. However, rather than
giving the exact values, it provides the probabilistic values between 0 & 1.

b) Support Vector Machine

Support vector machine or SVM is the popular machine learning algorithm, which is
widely used for classification and regression tasks. However, specifically, it is used to
solve classification problems. The main aim of SVM is to find the best decision
boundaries in an N-dimensional space, which can segregate data points into classes,
and the best decision boundary is known as Hyperplane. SVM selects the extreme
vector to find the hyperplane, and these vectors are known as support vectors.

c) Naïve Bayes
Naïve Bayes is another popular classification algorithm used in machine learning. It is
called so as it is based on Bayes theorem and follows the naïve(independent)
assumption between the features which is given as:

Each naïve Bayes classifier assumes that the value of a specific variable is independent
of any other variable/feature. For example, if a fruit needs to be classified based on
color, shape, and taste. So yellow, oval, and sweet will be recognized as mango. Here
each feature is independent of other features.

2. Unsupervised Machine learning models


Unsupervised Machine learning models implement the learning process opposite to
supervised learning, which means it enables the model to learn from the unlabeled
training dataset. Based on the unlabeled dataset, the model predicts the output. Using
unsupervised learning, the model learns hidden patterns from the dataset by itself
without any supervision.

Unsupervised learning models are mainly used to perform three tasks, which are as
follows:

o Clustering
Clustering is an unsupervised learning technique that involves clustering or groping
the data points into different clusters based on similarities and differences. The objects
with the most similarities remain in the same group, and they have no or very few
similarities from other groups.
Clustering algorithms can be widely used in different tasks such as Image
segmentation, Statistical data analysis, Market segmentation, etc.
Some commonly used Clustering algorithms are K-means Clustering, hierarchal
Clustering, DBSCAN, etc.

o Association Rule Learning


Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning
algorithm is to find the dependency of one data item on another data item and map
those variables accordingly so that it can generate maximum profit. This algorithm is
mainly applied in Market Basket analysis, Web usage mining, continuous
production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.
o Dimensionality Reduction
The number of features/variables present in a dataset is known as the dimensionality
of the dataset, and the technique used to reduce the dimensionality is known as the
dimensionality reduction technique.
Although more data provides more accurate results, it can also affect the performance
of the model/algorithm, such as overfitting issues. In such cases, dimensionality
reduction techniques are used.
"It is a process of converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information."
Different dimensionality reduction methods such as PCA(Principal Component
Analysis), Singular Value Decomposition, etc.
Reinforcement Learning
In reinforcement learning, the algorithm learns actions for a given set of states that
lead to a goal state. It is a feedback-based learning model that takes feedback signals
after each state or action by interacting with the environment. This feedback works as
a reward (positive for each good action and negative for each bad action), and the
agent's goal is to maximize the positive rewards to improve their performance.

The behavior of the model in reinforcement learning is similar to human learning, as


humans learn things by experiences as feedback and interact with the environment.

Below are some popular algorithms that come under reinforcement learning:

o Q-learning: Q-learning is one of the popular model-free algorithms of reinforcement


learning, which is based on the Bellman equation.

It aims to learn the policy that can help the AI agent to take the best action for
maximizing the reward under a specific circumstance. It incorporates Q values for each
state-action pair that indicate the reward to following a given state path, and it tries
to maximize the Q-value.

o State-Action-Reward-State-Action (SARSA): SARSA is an On-policy algorithm


based on the Markov decision process. It uses the action performed by the current
policy to learn the Q-value. The SARSA algorithm stands for State Action Reward
State Action, which symbolizes the tuple (s, a, r, s', a').
o Deep Q Network: DQN or Deep Q Neural network is Q-learning within the neural
network. It is basically employed in a big state space environment where defining a Q-
table would be a complex task. So, in such a case, rather than using Q-table, the neural
network uses Q-values for each action based on the state.

Training Machine Learning Models


Once the Machine learning model is built, it is trained in order to get the appropriate
results. To train a machine learning model, one needs a huge amount of pre-processed
data. Here pre-processed data means data in structured form with reduced null values,
etc. If we do not provide pre-processed data, then there are huge chances that our
model may perform terribly.
How to choose the best model?
In the above section, we have discussed different machine learning models and
algorithms. But one most confusing question that may arise to any beginner that
"which model should I choose?". So, the answer is that it depends mainly on the
business requirement or project requirement. Apart from this, it also depends on
associated attributes, the volume of the available dataset, the number of features,
complexity, etc. However, in practice, it is recommended that we always start with the
simplest model that can be applied to the particular problem and then gradually
enhance the complexity & test the accuracy with the help of parameter tuning and
cross-validation.

Difference between Machine learning model and


Algorithms
One of the most confusing questions among beginners is that are machine learning
models, and algorithms are the same? Because in various cases in machine learning
and data science, these two terms are used interchangeably.

The answer to this question is No, and the machine learning model is not the same as
an algorithm. In a simple way, an ML algorithm is like a procedure or method that
runs on data to discover patterns from it and generate the model. At the same time,
a machine learning model is like a computer program that generates output or
makes predictions. More specifically, when we train an algorithm with data, it
becomes a model.

Over fitting and Under fitting,


Overfitting and Underfitting in Machine
Learning
Overfitting and Underfitting are the two main problems that occur in machine learning
and degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output
by adapting the given set of unknown input. It means after providing training on the
dataset, it can produce reliable and accurate output. Hence, the underfitting and
overfitting are the two terms that need to be checked for the performance of the
model and whether the model is generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our


model. It means the more we train our model, the more chances of occurring the
overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.

How to avoid the Overfitting in Model


Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which we
can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear
regression model:

As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

How to avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally,
it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start increasing, so
the point, just before the raising of errors, is the good point, and we can stop here for
achieving a good model.

There are two other methods by which we can get a good point for our model, which are
the resampling method to estimate model accuracy and validation dataset.

Correctness,
Accuracy is a metric that measures how often a machine learning
model correctly predicts the outcome. You can calculate accuracy
by dividing the number of correct predictions by the total number
of predictions. In other words, accuracy answers the question:
how often the model is right?

How to Check the Accuracy of your Machine


Learning Model
Accuracy is well-known for the models used in Machine Learning for the validation
method that is used in evaluating the classification problems. The relative simplicity of
the accuracy is the main reason for its popularity. We can understand the accuracy very
easily, and it is simple to implement. Using the accuracy, we can easily assess the
performance of the model.

In Real-life situations, the problems that are in the modelling are rarely easy when
compared to other problems. We can work with such datasets that are imbalanced or
have multiclass or many classification problems. When we are doing any problem
using any method of machine learning, then always a high accuracy is not our main
goal. As you are able to solve more problems in ML, working out and utilizing precision
ends up being more suitable and requires additional thought.

To get a better understanding of the importance of accuracy, Actually, what accuracy


is, and how to calculate accuracy all will be covered in this article.

Accuracy:
Accuracy is mainly used in classification problems to get the appropriate percentage
of the models being executed to know how much of it is correctly predicted. In any
model, there will be several classification problems will present when we execute them,
then the accuracy plays the main role by talking about whether all the problems are
correctly predicted or not with respect to the total number of predictions made by the
model. We can calculate the accuracy of any model by dividing the correctly predicted
problems by the total number of predictions made.

The above formula is very useful for calculating the accuracy of any model. It provides
a simple understanding of a binary classification problem.

Accuracy Paradox:
When we execute the model, the default form will give the accuracy of the overall
metric about the performance of the whole dataset.

However, the overall accuracy of the classification models in machine learning can be
misleading when the distribution of the problems to the specific class is imbalanced,
and at that point in time, it is very difficult to predict the class of the model correctly.
In this situation, the class with the most occurrence will be predicted correctly, and the
accuracy of this predicted class will be a high accuracy score, whereas the class with
the low occurrence will be misclassified. When this type of situation occurs in the
model, then there is a high probability that the predicted accuracy will be wrong, and
we cannot predict the performance of the model correctly.

For example, in any health issue prediction model, we cannot miss any harmful disease
cases by any changes in the files of the patients. If any of the files got changed due to
any issues, then the predictor will directly predict the condition based on the files or
classes. Then the person with any small health issue has to face serious treatment
because of the changes or misclassified classes.

Now let us take an example that uses the dataset of breast cancer, which is used to
classify the breast tumor cases

Worst Worst Worst Worst Worst Worst W


area smoothness compactness concavity Concave symmetry d
points

545.5 0.2151 0.1894 0.1859 0.0854 0.3215 0


790.6 0.1254 0.5445 0.4584 0.1546 0.3154 0

1562 0.1452 0.4152 0.6541 0.2745 0.3341 0

625.2 0.1365 0.1854 0.1457 0.0965 0.3487 0

850.2 0.1854 0.5410 0.4754 0.1716 0.3645 0

Before executing the model, we change the data that is imbalanced by removing the
data that is more harmful, so after removing all the harmful data from the dataset, the
accuracy will be around 5.6%.

1. tumour_data_imbalanced["labels"].value_counts(normalize = True)

Output:

Good 0.854698
Bad 0.0548726
Name: labels, dtype: float64

Let us know how to predict the accuracy of the model

1. Model = DecisionTreeclassifier(rand_state = rand_seed)


2. pred_res = get_pred_res(X_tr, Y_tr, Y_ts, mode)
3. pred_res["The prediction success is:"].sum()/pred_res["The prediction success is:"].cou
nt()

Output:

0.9854126

Our model accomplished a general accuracy of ~0.9464 for the entire model. This
outcome is, by all the data, strikingly great. However, on the off chance that we
investigate the class-level forecasts utilizing a disarray framework, we get a totally
different picture.
Our model misdiagnosed practically all difficult cases. The outcome is the very inverse
of what we anticipated in view of the general precision metric. The circumstance is a
commonplace illustration of the precision mystery. While you accomplish a high
exactness esteem, it gives you a bogus reason as your dataset is exceptionally
imbalanced, and misprediction of the minority class is exorbitant.

In such circumstances, you attempt to anticipate uncommonly, however basic, dangers


with systemic consequences. Models are serious clinical diseases, monetary
emergencies, militant psychological assaults, meteors, and so forth.

It does not make any difference in the event that your model accomplishes 99.99%
precision, assuming missing a solitary case is sufficient to disrupt the entire framework.
Depending on the exactness score as determined above is not sufficient and could
misdirect.
o Precision: precision is defined as the percentage of the correct predictions among the
total prediction of a class between all of the classes present in the dataset.
o Recall: Recall is defined as the proportion of the correct predictions among the total
prediction of a class between all of the classes present in the dataset.
o F-score: F-score is defined as a metric combination of precision and also the recall.
o Confusion matrix: A confusion matrix is defined as the tabular summary of true/false
or positive/negative prediction rates.
o ROC curve: ROC is defined as the binary classification of the diagnostic plot of the
curve.

Accuracy in Binary Classification:


In the binary classification case, we can communicate precision in True/False
Positive/Negative qualities. The accuracy recipe in Machine learning is given as follows:

In the accuracy classes, there are only two classes present. They are positive/negative:

o TP: TP stands for True positive, i.e., These are the classes that are correctly predicted,
and the correctly predicted classes are marked as positive.
o FP: FP stands for false positive, i.e., These are the classes that are falsely predicted, and
the falsely predicted classes are marked as positive.
o TN: TN stands for True positive, i.e., These are the classes that are correctly predicted,
and the correctly predicted classes are marked as negative.
o FN: FN stands for False negative, i.e., These are the classes that are falsely predicted,
and the falsely predicted classes are marked as negative.

Accuracy in Multiclass Problems:


In a multiclass classification problem, we can use the same definition as the binary
classification problem. But in the multiclass, we cannot directly predict the outcome as
true/false definitions. So, in multiclass problems, we can use other formulas to calculate
the accuracy of the model.
The terms used in the above formula are:

o N is used as the term that refers to the number of samples


o [[..]] These are the brackets that return 1 when the given formula or expression is true
and returns 0 otherwise.
o Yi and zi are labeled after the output has been generated or predicted as true.
o Example:
o Let us take a confusion matrix with some true values and predict the accuracy
for any three classes of the dataset.

o Now, we have to select the number of correct predictions in the table, and then
we have to find the accuracy of the model using the formula given above.
o Multiclass Accuracy = 7+5+3/31 = 0.483
o The above result of the formula says that our model achieved a 48% accuracy
in this class (i.e., multiclass classification problem).
o Accuracy in Multilabel Problems:
o In Multilabel classification problems, the classes of the dataset are mutually
exclusive to each other. Whereas multilabel classification is different from
multiclass problems because, in multiclass classification, the classes are
mutually non-exclusive to each other. In machine learning, we can represent
multilabel classification as multiple binary classification problems.

The multilabel accuracy is also known as Hamming score. In multilabel classification,


the accuracy is calculated by the correctly predicted labels and the number of active
labels.

The terms used in the above formula are

o N is used as the term that refers to the number of samples


o Yi and zi are labeled after the output has been generated or predicted as true.
Multilabel Accuracy gives a more adjusted metric since it does not depend on the
'exact match' rule (like Subset Accuracy). It neither one of them considers 'True
Negative' values as 'right' (as in our guileless case).

Example:

1. def ham_score(y_ts, pred):


2. return((y_ts & pred).sum(axis = 1)/(y_ts | pred).sum(axis = 1))
3. .sum() / pred.shape[0]
4. ham_score(y_ts, pred)

Output:

0.45827400

If the hamming score we get after the calculation is closer to one, then the
performance of the model will be good.

Hamming Loss:
Hamming loss is a method that is used to get the ratio of data that is predicted labels
wrongly. It will take the values in the range of 0 and 1, where the 0 is used to represent
no errors in the model.

The terms used in the above formula are

o N is used as the term that refers to the number of samples


o K is used as the term that refers to the number of labels
o Yi and zi are labeled after the output has been generated or predicted as true.

1. def ham_loss(y_ts, pred):


2. return (y_ts != pred).sum().sum() / y_.size
3. ham_loss(y_ts, pred)

Output:

0.03501485

If the hamming loss we get after the calculation is closer to one, then the performance of the
model will be good.
Other than these measurements, you can utilize the multilabel adaptation of similar
arrangement measurements you have found in the paired and multiclass case (e.g., accuracy,
review, F-score). You can likewise apply averaging methods (miniature, large-scale, and test-
based) or ranking-based measurements.

Subset Accuracy or Exact Match Ratio:


Subset accuracy is also known as Exact match Ratio or Label set accuracy. It is a strict
version for calculating the accuracy of the model, where in this type of prediction it will show
the correct prediction if all the labels are matched for the given sample.

The terms used in the above formula are

o N is used as the term that refers to the number of samples


o [[..]] These are the brackets that return 1 when the given formula or expression is true
and returns 0 otherwise.
o Yi and zi are labeled after the output has been generated or predicted as true.

In Machine learning, we always work with a large number of labels and datasets.
Sometimes it takes more work to predict all of them correctly. But using the above
accuracy technique, ques we can find the accuracy very easily when compared with
others. The subset accuracy technique shows the low performance of the model
compared with other techniques.

This measurement does not give data about halfway accuracy due to the stringent
standard it depends on. On the off chance that our model neglects to foresee just a
solitary mark from the 103 but performs well on the rest, Subset Precision actually
orders these expectations as disappointments.

When to use Accuracy Score in Machine Learning:


Accuracy score should be utilized when you need to know the expertise of a model to
group data points of the classes accurately in the dataset, regardless of the
performance prediction per class or label of the dataset. It gives you an instinct for
whether the given data of the dataset is suitable for the classification purpose or not.

Assuming you really want to use the precision metric in your project or work, there are
exceptionally easy-to-utilize packages like deep checks that make sure that it is used
to give you the top to bottom reports on important measurements to assess your
model. This makes it simpler or easier for you to all the more likely understand your
model's performance.
The Bias-Variance Tradeoff,

In statistics and machine learning, the bias–variance


tradeoff describes the relationship between a model's complexity,
the accuracy of its predictions, and how well it can make
predictions on previously unseen data that were not used to train
the model.

Bias and Variance in Machine Learning


Machine learning is a branch of Artificial Intelligence, which allows machines to
perform data analysis and make predictions. However, if the machine learning model
is not accurate, it can make predictions errors, and these prediction errors are usually
known as Bias and Variance. In machine learning, these errors will always be present
as there is always a slight difference between the model predictions and actual
predictions. The main aim of ML/data science analysts is to reduce these errors in order
to get more accurate results. In this topic, we are going to discuss bias and variance,
Bias-variance trade-off, Underfitting and Overfitting. But before starting, let's first
understand what errors in Machine learning are?
Errors in Machine Learning?
In machine learning, an error is a measure of how accurately an algorithm can make
predictions for the previously unknown dataset. On the basis of these errors, the
machine learning model is selected that can perform best on the particular dataset.
There are mainly two types of errors in machine learning, which are:

o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.

o Irreducible errors: These errors will always be present in the model

regardless of which algorithm has been used. The cause of these errors is unknown
variables whose value can't be reduced.

What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies
them to test data for prediction. While making predictions, a difference occurs
between prediction values made by the model and actual values/expected
values, and this difference is known as bias errors or Errors due to bias. It can be
defined as an inability of machine learning algorithms such as Linear Regression to
capture the true relationship between the data points. Each algorithm begins with
some amount of bias because bias occurs from assumptions in the model, which
makes the target function simple to learn. A model has either:

o Low Bias: A low bias model will make fewer assumptions about the form of the target
function.
o High Bias: A model with a high bias makes more assumptions, and the model becomes
unable to capture the important features of our dataset. A high bias model also
cannot perform well on new data.

Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler
the algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear
algorithm often has low bias.

Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an algorithm
with high bias is Linear Regression, Linear Discriminant Analysis and Logistic
Regression.

Ways to reduce High Bias:


High bias mainly occurs due to a much simple model. Below are some ways to reduce
the high bias:

o Increase the input features as the model is underfitted.


o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.

What is a Variance Error?


The variance would specify the amount of variation in the prediction if the different
training data was used. In simple words, variance tells that how much a random
variable is different from its expected value. Ideally, a model should not vary too
much from one training dataset to another, which means the algorithm should be
good in understanding the hidden mapping between inputs and output variables.
Variance errors are either of low variance or high variance.

Low variance means there is a small variation in the prediction of the target function
with changes in the training data set. At the same time, High variance shows a large
variation in the prediction of the target function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with the training
datvvaset, and does not generalize well with the unseen dataset. As a result, such a
model gives good results with the training dataset but shows high error rates on the
test dataset.

Since, with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:
o A high variance model leads to overfitting.
o Increase model complexities.

Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance.

Some examples of machine learning algorithms with low variance are, Linear
Regression, Logistic Regression, and Linear discriminant analysis. At the same
time, algorithms with high variance are decision tree, Support Vector Machine, and
K-nearest neighbours.

Ways to Reduce High Variance:


o Reduce the input features or number of parameters as a model is overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

Different Combinations of Bias-Variance


There are four possible combinations of bias and variances, which are represented by
the below diagram:
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning model.
However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are consistent
but inaccurate on average. This case occurs when a model does not learn well with the
training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.

How to identify High variance or High Bias?


High variance can be identified if the model has:
o Low training error and high test error.

High Bias can be identified if the model has:

o High training error and the test error is almost similar to training error.

Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias
and variance in order to avoid overfitting and underfitting in the model. If the model
is very simple with fewer parameters, it may have low variance and high bias. Whereas,
if the model has a large number of parameters, it will have high variance and low bias.
So, it is required to make a balance between bias and variance errors, and this balance
between the bias error and variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias.
But this is not possible because bias and variance are related to each other:

o If we decrease the variance, it will increase the bias.


o If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a


model that accurately captures the regularities in training data and simultaneously
generalizes well with the unseen dataset. Unfortunately, doing this is not possible
simultaneously. Because a high variance algorithm may perform well with training data,
but it may lead to overfitting to noisy data. Whereas, high bias algorithm generates a
much simple model that may not even capture important regularities in the data. So,
we need to find a sweet spot between bias and variance to make an optimal model.

Hence, the Bias-Variance trade-off is about finding the sweet spot to make a
balance between bias and variance errors.

v
Feature Extraction and Selection,
Feature selection and feature extraction are both techniques used in machine
learning to improve the performance of models. However, they approach this goal in
fundamentally different ways:

Feature Selection:

 What it does: Selects a subset of the original features that are most relevant to the
prediction task. It's like picking the most informative ingredients from a recipe.
 Benefits: Reduces training time and computational cost by focusing on the most
important features. Improves model interpretability by highlighting the key factors
influencing predictions. Can help to avoid overfitting by reducing the dimensionality
of the data.
 Challenges: Choosing the right selection method depends on the data and the
problem. Might discard potentially useful information if the selection process isn't
careful.

Feature Extraction:

 What it does: Creates entirely new features from the original data. It's like
combining ingredients in a new way to create a more flavorful dish.
 Benefits: Can capture hidden patterns or relationships within the data that might not
be evident in the original features. May lead to more accurate and robust models by
representing the data in a more informative way.
 Challenges: Designing effective feature extraction methods requires domain
knowledge and expertise. Can increase the dimensionality of the data, potentially
leading to overfitting if not addressed.
Here's a table summarizing the key differences:

Table : The differences between Feature Selection and Feature Extraction

Choosing the right technique:

The choice between feature selection and feature extraction depends on the specific
problem and data:

 If interpretability and reducing complexity are priorities, feature selection might be a


good choice.
 If capturing hidden patterns and potentially improving accuracy is more important,
feature extraction could be beneficial.
 In some cases, a combination of both techniques might be used for optimal results.

It's important to evaluate the impact of both approaches on your machine learning
model's performance to determine the best course of action.

Feature Selection Techniques in Machine


Learning
Feature selection is a way of selecting the subset of the most relevant features from the original
features set by removing the redundant, irrelevant, or noisy features.
While developing the machine learning model, only a few variables in the dataset are
useful for building the model, and the rest features are either redundant or irrelevant.
If we input the dataset with all these redundant and irrelevant features, it may
negatively impact and reduce the overall performance and accuracy of the model.
Hence it is very important to identify and select the most appropriate features from
the data and remove the irrelevant or less important features, which is done with the
help of feature selection in machine learning.

Feature selection is one of the important concepts of machine learning, which highly
impacts the performance of the model. As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to input the most appropriate and
relevant dataset to the model in order to get a better result.

In this topic, we will discuss different feature selection techniques for machine learning.
But before that, let's first understand some basics of feature selection.

o What is Feature Selection?


o Need for Feature Selection
o Feature Selection Methods/Techniques
o Feature Selection statistics

What is Feature Selection?


A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature
selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection
is about selecting the subset of the original feature set, whereas feature extraction
creates new features. Feature selection is a way of reducing the input variable for the
model by using only relevant data in order to reduce overfitting in the model.

So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.

Need for Feature Selection


Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is
necessary to provide a pre-processed and good input dataset in order to get better
outcomes. We collect a huge amount of data to train our model and help it to learn
better. Generally, the dataset consists of noisy data, irrelevant data, and some part of
useful data. Moreover, the huge amount of data also slows down the training process
of the model, and with noise and irrelevant data, the model may not predict and
perform well. So, it is very necessary to remove such noises and less-important data
from the dataset and to do this, and Feature selection techniques are used.

Selecting the best features helps the model to perform well. For example, Suppose we
want to create a model that automatically decides which car should be crushed for a
spare part, and to do this, we have a dataset. This dataset contains a Model of the car,
Year, Owner's name, Miles. So, in this dataset, the name of the owner does not
contribute to the model performance as it does not decide if the car should be crushed
or not, so we can remove this column and select the rest of the features(column) for
the model building.

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that it can be easily interpreted by
the researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.
Feature Selection Techniques
There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique


Supervised Feature selection techniques consider the target variable and can be used
for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used
for the unlabelled dataset.

There are mainly three techniques under supervised feature Selection:

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with
this feature set, the model has trained again.

Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process, which begins with an


empty set of features. After each iteration, it keeps adding on a feature and evaluates
the performance to check whether it is improving the performance or not. The process
continues until the addition of a new variable/feature does not improve the
performance of the model.
o Backward elimination - Backward elimination is also an iterative approach, but it is
the opposite of forward selection. This technique begins the process by considering all
the features and removes the least significant feature. This elimination process
continues until removing the features does not improve the performance of the model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature
selection methods, which evaluates each feature set as brute-force. It means this
method tries & make each possible combination of features and return the best
performing feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization approach, where
features are selected by recursively taking a smaller and smaller subset of features.
Now, an estimator is trained with each set of features, and the importance of each
feature is determined using coef_attribute or through a feature_importances_attribute.

2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.

The filter method filters out the irrelevant feature and redundant columns from the
model by using different metrics through ranking.

The advantage of using filter methods is that it needs low computational time and
does not overfit the data.

Some common techniques of Filter methods are as follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Information Gain: Information gain determines the reduction in entropy while
transforming the dataset. It can be used as a feature selection technique by calculating
the information gain of each variable with respect to the target variable.

Chi-square Test: Chi-square test is a technique to determine the relationship between


the categorical variables. The chi-square value is calculated between each feature and
the target variable, and the desired number of features with the best chi-square value
is selected.

Fisher's Score:

Fisher's score is one of the popular supervised technique of features selection. It


returns the rank of the variable on the fisher's criteria in descending order. Then we
can select the variables with a large fisher's score.

Missing Value Ratio:

The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number
of missing values in each column divided by the total number of observations. The
variable is having more than the threshold value can be dropped.

3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are
fast processing methods similar to the filter method but more accurate than the filter
method.
These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:

o Regularization- Regularization adds a penalty term to different parameters of the


machine learning model for avoiding overfitting in the model. This penalty term is
added to the coefficients; hence it shrinks some coefficients to zero. Those features
with zero coefficients can be removed from the dataset. The types of regularization
techniques are L1 Regularization (Lasso Regularization) or Elastic Nets (L1 and L2
regularization).
o Random Forest Importance - Different tree-based methods of feature selection help
us with feature importance to provide a way of selecting features. Here, feature
importance specifies which feature has more importance in model building or has a
great impact on the target variable. Random Forest is such a tree-based method, which
is a type of bagging algorithm that aggregates a different number of decision trees. It
automatically ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all the trees. Nodes are arranged as per the impurity values, and thus it
allows to pruning of trees below a specific node. The remaining nodes create a subset
of the most important features.
How to choose a Feature Selection Method?
For machine learning engineers, it is very important to understand that which feature
selection method will work properly for their model. The more we know the datatypes
of variables, the easier it is to choose the appropriate statistical measure for feature
selection.

To know this, we need to first identify the type of input and output variables. In
machine learning, variables are of mainly two types:

o Numerical Variables: Variable with continuous values such as integer, float


o Categorical Variables: Variables with categorical values such as Boolean, ordinal,
nominals.

Below are some univariate statistical measures, which can be used for filter-based
feature selection:

1. Numerical Input, Numerical Output:

Numerical Input variables are used for predictive regression modelling. The common
method to be used for such a case is the Correlation coefficient.

o Pearson's correlation coefficient (For linear Correlation).


o Spearman's rank coefficient (for non-linear correlation).

2. Numerical Input, Categorical Output:


Numerical Input with categorical output is the case for classification predictive
modelling problems. In this case, also, correlation-based techniques should be used,
but with categorical output.

o ANOVA correlation coefficient (linear).


o Kendall's rank coefficient (nonlinear).

3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling with categorical input. It is a


different example of a regression problem. We can use the same measures as
discussed in the above case but in reverse order.

4. Categorical Input, Categorical Output:

This is a case of classification predictive modelling with categorical Input variables.

The commonly used technique for such a case is Chi-Squared Test. We can also use
Information gain in this case.

We can summarise the above cases with appropriate measures in the below
table:

Input Variable Output Variable Feature Selection technique

Numerical Numerical
o Pearson's correlation coefficient (For linear Correlation).
o Spearman's rank coefficient (for non-linear correlation).

Numerical Categorical
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).

Categorical Numerical
o Kendall's rank coefficient (linear).
o ANOVA correlation coefficient (nonlinear).

Categorical Categorical
o Chi-Squared test (contingency tables).
o Mutual Information.
Conclusion
Feature selection is a very complicated and vast field of machine learning, and lots of
studies are already made to discover the best methods. There is no fixed rule of the
best feature selection method. However, choosing the method depend on a machine
learning engineer who can combine and innovate approaches to find the best method
for a specific problem. One should try a variety of model fits on different subsets of
features selected through different statistical Measures.

k-Nearest Neighbors,
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider
the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN


Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:

Backward Skip 10sPlay Video

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.

Python implementation of the KNN algorithm


To do the Python implementation of the K-NN algorithm, we will use the same
problem and dataset which we have used in Logistic Regression. But here we will
improve the performance of the model. Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who are
interested in buying that SUV. So for this problem, we have a dataset that contains
multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below is the
dataset:
Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression.
Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

By executing the above code, our dataset is imported to our program and well pre-
processed. After feature scaling our test dataset will look like:

From the above output image, we can see that our data is successfully scaled.
o Fitting K-NN classifier to the Training data:
Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing
the class, we will create the Classifier object of the class. The Parameter of this
class will be
o n_neighbors: To define the required neighbors of the algorithm. Usually,
it takes 5.
o metric='minkowski': This is the default parameter and it decides the
distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:

1. #Fitting K-NN classifier to the training set


2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

o Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

The output for the above code will be:


o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the
accuracy of the classifier. Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and called it using the
variable cm.

Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7
incorrect predictions, whereas, in Logistic Regression, there were 11 incorrect predictions.
So we can say that the performance of the model is improved by using the K-NN
algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will
remain same as we did in Logistic Regression, except the name of the graph.
Below is the code for it:

1. #Visulaizing the trianing set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() +
1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sha
pe),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:

o As we can see the graph is showing the red point and green points. The
green points are for Purchased(1) and Red Points for not Purchased(0)
variable.
o The graph is showing an irregular boundary instead of showing any
straight line or any curve because it is a K-NN algorithm, i.e., finding the
nearest neighbor.
o The graph has classified users in the correct categories as most of the
users who didn't buy the SUV are in the red region and users who bought
the SUV are in the green region.
o The graph is showing good result but still, there are some green points
in the red region and red points in the green region. But this is no big
issue as by doing this model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new
dataset, i.e., Test dataset. Code remains the same except some minor changes:
such as x_train and y_train will be replaced by x_test and y_test.
Below is the code for it:

1. #Visualizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() +
1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sha
pe),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:
The above graph is showing the output for the test data set. As we can see in the graph,
the predicted output is well good as most of the red points are in the red region and most
of the green points are in the green region.

However, there are few green points in the red region and a few red points in the green
region. So these are the incorrect observations that we have observed in the confusion
matrix(7 Incorrect output).

Naive Bayes,
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target variable


"Play". So using this dataset we need to decide that whether we should play or not on
a particular day according to the weather conditions. So to solve this problem, we need
to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

ADVERTISEMENT

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:


Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14=

Rainy 2 2 4/14=

Sunny 2 3 5/14=

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29
P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:


There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.

Python Implementation of the Naïve Bayes


algorithm:
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use
the "user_data" dataset, which we have used in our other classification model.
Therefore we can easily compare the Naive Bayes model with the other models.

Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:


In this step, we will pre-process/prepare the data so that we can use it efficiently in our
code. It is similar as we did in data-pre-processing. The code for this is given below:

1. Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set,
and then we have scaled the feature variable.

The output for the dataset is given as:


2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training
set. Below is the code for it:
1. # Fitting Naive Bayes to the Training set
2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the training
dataset. We can also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)


3) Prediction of the test set result:
Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.

1. # Predicting the Test set results


2. y_pred = classifier.predict(x_test)

Output:
The above output shows the result for prediction vector y_pred and real vector y_test.
We can see that some predications are different from the real values, which are the
incorrect predictions.

4) Creating Confusion Matrix:


Now we will check the accuracy of the Naive Bayes classifier using the Confusion
matrix. Below is the code for it:
1. # Making the Confusion Matrix
2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.

5) Visualizing the training set result:


Next we will visualize the training set result using Naïve Bayes Classifier. Below is the
code for it:

1. # Visualising the Training set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
= 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data
points with the fine boundary. It is Gaussian curve as we have
used GaussianNB classifier in our code.

6) Visualizing the Test set result:


1. # Visualising the Test set results
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step
= 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above output is final output for test set data. As we can see the classifier has
created a Gaussian curve to divide the "purchased" and "not purchased" variables.
There are some wrong predictions which we have calculated in Confusion matrix. But
still it is pretty good classifier.

Simple Linear Regression,


Simple Linear Regression in Machine Learning
Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent variable. The
relationship shown by a Simple Linear Regression model is linear or a sloped straight
line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on
continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

ADVERTISEMENT

ADVERTISEMENT

o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:


The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing
or decreasing.
ε = The error term. (For a good model it will be negligible)

Implementation of Simple Linear Regression


Algorithm using Python
Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best
fitting line for representing the relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python,
we need to follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing.
We have already done it earlier in this tutorial. But there will be some changes, which
are given in the below steps:

o First, we will import the three important libraries, which will help us for loading the
dataset, plotting the graphs, and creating the Simple Linear Regression model.

1. import numpy as nm
2. import matplotlib.pyplot as mtp
3. import pandas as pd
o Next, we will load the dataset into our code:

1. data_set= pd.read_csv('Salary_Data.csv')

By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.

Note: In Spyder IDE, the folder containing the code file must be saved as a working
directory, and the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from the given
dataset. The independent variable is years of experience, and the dependent variable
is salary. Below is code for it:

1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values

In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as a
parameter, since we want to extract the second column and indexing starts from the
zero.

By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.

o Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10 observations
for the test set. We are splitting our dataset so that we can train our model using a
training dataset and then test the model using a test dataset. The code for this is given
below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:

Test-dataset:
Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries
take care of it for some cases, so we don't need to perform it here. Now, our dataset is
well prepared to work on it and we are going to start building a Simple Linear
Regression model for the given problem.

Step-2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After
importing the class, we are going to create an object of the class named as a regressor.
The code for this is given below:

1. #Fitting the Simple Linear Regression model to the training dataset


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression
object to the training set. In the fit() function, we have passed the x_train and y_train,
which is our training dataset for the dependent and an independent variable. We have
fitted our regressor object to the training set so that the model can easily learn the
correlations between the predictor and target variables. After executing the above lines
of code, we will get the below output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is
ready to predict the output for the new observations. In this step, we will provide the
test dataset (new observations) to the model to check whether it can predict the
correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain predictions
of test dataset, and prediction of training set respectively.

1. #Prediction of Test and Training set result


2. y_pred= regressor.predict(x_test)
3. x_pred= regressor.predict(x_train)

On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the training
set and test set.

Output:

You can check the variable by clicking on the variable explorer option in the IDE, and
also compare the result by comparing values from y_pred and y_test. By comparing
these values, we can check how good our model is performing.

Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary
of employees. In the function, we will pass the real values of training set, which means
a year of experience x_train, training set of Salaries y_train, and color of the
observations. Here we are taking a green color for the observation, but it can be any
color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot() function of
the pyplot library. In this function, we will pass the years of experience for training set,
predicted salary for training set x_pred, and color of the line.

Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".

ADVERTISEMENT

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.

Finally, we will represent all above things in a graph using show(). The code is given
below:

1. mtp.scatter(x_train, y_train, color="green")


2. mtp.plot(x_train, x_pred, color="red")
3. mtp.title("Salary vs Experience (Training Dataset)")
4. mtp.xlabel("Years of Experience")
5. mtp.ylabel("Salary(In Rupees)")
6. mtp.show()

Output:

By executing the above lines of code, we will get the below graph plot as an output.
ADVERTISEMENT

In the above plot, we can see the real values observations in green dots and predicted
values are covered by the red regression line. The regression line shows a correlation
between the dependent and independent variable.

The good fit of the line can be observed by calculating the difference between actual
values and predicted values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our model is good for the
training set.

ADVERTISEMENT

Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training
set. Now, we will do the same for the Test set. The complete code will remain the same
as the above code, except in this, we will use x_test, and y_test instead of x_train and
y_train.

Here we are also changing the color of observations and regression line to differentiate
between the two plots, but it is optional.

1. #visualizing the Test set results


2. mtp.scatter(x_test, y_test, color="blue")
3. mtp.plot(x_train, x_pred, color="red")
4. mtp.title("Salary vs Experience (Test Dataset)")
5. mtp.xlabel("Years of Experience")
6. mtp.ylabel("Salary(In Rupees)")
7. mtp.show()

Output:

By executing the above line of code, we will get the output as:

In the above plot, there are observations given by the blue color, and prediction is
given by the red regression line. As we can see, most of the observations are close to
the regression line, hence we can say our Simple Linear Regression is a good model
and able to make good predictions.

Multiple Regression,
Multiple Linear Regression
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there
may be various cases in which the response variable is affected by more than one
predictor variable; for such cases, the Multiple Linear Regression algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it


takes more than one predictor variable to predict the response variable. We can define
it as:

Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

ADVERTISEMENT

Some key points about MLR:

ADVERTISEMENT

ADVERTISEMENT

o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear
Regression, so the same is applied for the multiple linear regression equation, the
equation becomes:

1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+
b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)

Where,

Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:


o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.

Implementation of Multiple Linear Regression model using


Python:
To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main


information: R&D Spend, Administration Spend, Marketing Spend, State, and
Profit for a financial year. Our goal is to create a model that can easily determine
which company has a maximum profit, and which is the most affecting factor for the
profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the MLR
model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this
tutorial. This process contains the below steps:

o Importing libraries: Firstly we will import the library which will help in building the
model. Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which contains all
the variables. Below is the code for it:

1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')

Output: We will get the dataset as:

In above output, we can clearly see that there are five variables, in which four variables
are continuous and one is categorical variable.

o Extracting dependent and independent Variables:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, :-1].values
3. y= data_set.iloc[:, 4].values

Output:

Out[5]:

array([[165349.2, 136897.8, 471784.1, 'New York'],


[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)
As we can see in the above output, the last column contains categorical variables which
are not suitable to apply directly for fitting the model. So we need to encode this
variable.

Encoding Dummy Variables:

As we have one categorical variable (State), which cannot be directly applied to the
model, so we will encode it. To encode the categorical variable into numbers, we will
use the LabelEncoder class. But it is not sufficient because it still has some relational
order, which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:

1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()

Here we are only encoding one independent variable, which is state as other variables
are continuous.

Output:
As we can see in the above output, the state column has been converted into dummy
variables (0 and 1). Here each dummy variable column is corresponding to the one
State. We can check by comparing it with the original dataset. The first column
corresponds to the California State, the second column corresponds to the Florida
State, and the third column corresponds to the New York State.

Note: We should not use all the dummy variables at the same time, so it must be 1
less than the total number of dummy variables, else it will create a dummy variable
trap.
o Now, we are writing a single line of code just to avoid the dummy variable trap:

1. #avoiding the dummy variable trap:


2. x = x[:, 1:]

If we do not remove the first dummy variable, then it may introduce multicollinearity
in the model.
As we can see in the above output image, the first column has been removed.

o Now we will split the dataset into training and test set. The code for this is given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

The above code will split our dataset into a training set and test set.

Output: The above code will split the dataset into training set and test set. You can
check the output by clicking on the variable explorer option given in Spyder IDE. The
test set and training set will look like the below image:

Test set:
Training set:
ADVERTISEMENT

Note: In MLR, we will not do feature scaling as it is taken care by the library, so we
don't need to do it manually.
Step: 2- Fitting our MLR model to the Training set:
Now, we have well prepared our dataset in order to provide training, which means we
will fit our regression model to the training set. It will be similar to as we did in Simple
Linear Regression model. The code for this will be:

1. #Fitting the MLR model to the training set:


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)

Output:

Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)
Now, we have successfully trained our model using the training dataset. In the next
step, we will test the performance of the model using the test dataset.

Step: 3- Prediction of Test set results:


The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is
the code for it:

1. #Predicting the Test set result;


2. y_pred= regressor.predict(x_test)

By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test
set values.

Output:
In the above output, we have predicted result set and test set. We can check model
performance by comparing these two value index by index. For example, the first index
has a predicted value of 103015$ profit and test/real value of 103282$ profit. The
difference is only of 267$, which is a good prediction, so, finally, our model is
completed here.

o We can also check the score for training dataset and test dataset. Below is the code for
it:

1. print('Train Score: ', regressor.score(x_train, y_train))


2. print('Test Score: ', regressor.score(x_test, y_test))

Output: The score is:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446

The above score tells that our model is 95% accurate with the training dataset
and 93% accurate with the test dataset.

ADVERTISEMENT

Note: In the next topic, we will see how we can improve the performance of the
model using the Backward Elimination process.
Applications of Multiple Linear Regression:
There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:


o Predicting the impact of changes:

Digression,
We introduce a novel measure, called digression, to assess the
value- disclosure risk in constructing regression trees for data
partitioning; specifically, an algorithm is developed that uses the
measure for pruning the tree to limit disclosure of sensitive data.
Logistic Regression

UNIT V
Clustering: The Idea,
The Model,
Choosing k,
Bottom-Up Hierarchical Clustering.
Recommender Systems: Manual Curation,
Recommending What’s Popular,
User-Based
Collaborative Filtering,
Item-Based Collaborative Filtering,
Matrix Factorization Data Ethics,
Building Bad Data Products,
Trading Off Accuracy and Fairness,
Collaboration,
Interpretability,
Recommendations,
Biased Data,
Data Protection IPython,
Mathematics,
NumPy,
pandas,
scikit-learn,
Visualization,R

You might also like