Lab Manual 04
Lab Manual 04
Lab Manual 04
Machine Learning
Logistic Regression
Dated:
29th Jan, 2024 to 2nd Feb, 2024
Semester:
2024
Objectives:-
1. Test/Train
2. Save and Load trained Model
3. Pickle and sklearn joblib
4. Logistic Regression
What is Train/Test
It is called Train/Test because you split the data set into two sets: a training set and a testing set.
Our data set illustrates 100 customers in a shop, and their shopping habits.
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
plt.scatter(x, y)
plt.show()
The training set should be a random selection of 80% of the original data.
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
plt.scatter(train_x, train_y)
plt.show()
To make sure the testing set is not completely different, we will take a look at the testing set as
well.
plt.scatter(test_x, test_y)
plt.show()
Example
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
The result can back my suggestion of the data set fitting a polynomial regression, even though it
would give us some weird results if we try to predict values outside of the data set. Example: the
line indicates that a customer spending 6 minutes in the shop would make a purchase worth 200.
That is probably a sign of overfitting.
But what about the R-squared score? The R-squared score is a good indicator of how well my
data set is fitting the model.
R2
It measures the relationship between the x axis and the y axis, and the value ranges from 0 to 1,
where 0 means no relationship, and 1 means totally related.
The sklearn module has a method called r2_score() that will help us find this relationship.
In this case we would like to measure the relationship between the minutes a customer stays in
the shop and how much money they spend.
Example
import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
Now we have made a model that is OK, at least when it comes to training data.
Now we want to test the model with the testing data as well, to see if gives us the same result.
Example
import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
Predict Values
Now that we have established that our model is OK, we can start predicting new values.
Example
How much money will a buying customer spend, if she or he stays in the shop for 5 minutes?
print(mymodel(5))
Solving a problem in ML consist of two steps typically. The first step is to training a model
using your training dataset and the second step is to ask your questions to the trained model
which sort like a human brain and that will give you the answers often the size of the training
dataset is pretty huge because as the size increases your model becomes more accurate. It is
like if you are doing a football training and if you train yourself more and more you become
more and more better at your football game and when your training dataset is so huge often it
in like giga bytes the training steps become more time consuming if you save the train model
to a file you can latter on use that same model to make the actual prediction. So, you don’t
need to train it every time you want to ask these questions
Open the Linear Regression Python File predicting Home Prices and do these changes.
In the above file take a new code block in google colab and write below code
import pickle
mp.coef_
mp.intercept_
mp.predict([[5000]])
joblib.dump(model, 'model_joblib')
mj = joblib.load('model_joblib')
mj.coef_
mj.intercept_
mj.predict([[5000]])
Logistic Regression
Logistic regression aims to solve classification problems. It does this by predicting categorical
outcomes, unlike linear regression that predicts a continuous outcome.
In the simplest case there are two outcomes, which is called binomial, an example of which is
predicting if a tumor is malignant or benign. Other cases have more than two outcomes to
classify, in this case it is called multinomial. A common example for multinomial logistic
regression would be predicting the class of an iris flower between 3 different species.
Here we will be using basic logistic regression to predict a binomial variable. This means it has
only two possible outcomes.
In Python we have modules that will do the work for us. Start by importing the NumPy module.
import numpy
We will use a method from the sklearn module, so we will have to import that module as well:
From the sklearn module we will use the LogisticRegression() method to create a logistic
regression object.
This object has a method called fit() that takes the independent and dependent values as
parameters and fills the regression object with data that describes the relationship:
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
logr = linear_model.LogisticRegression()
logr.fit(X,y)
Now we have a logistic regression object that is ready to whether a tumor is cancerous based on
the tumor size:
Example
import numpy
from sklearn import linear_model
logr = linear_model.LogisticRegression()
logr.fit(X,y)
Coefficient
In logistic regression the coefficient is the expected change in log-odds of having the outcome
per unit change in X.
This does not have the most intuitive understanding so let's use it to create something that makes
more sense, odds.
Example
import numpy
from sklearn import linear_model
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
log_odds = logr.coef_
odds = numpy.exp(log_odds)
print(odds)
This tells us that as the size of a tumor increases by 1mm the odds of it being a cancerous tumor
increases by 4x.
Probability
The coefficient and intercept values can be used to find the probability that each tumor is
cancerous.
Create a function that uses the model's coefficient and intercept values to return a new value.
This new value represents probability that the given observation is a tumor:
def logit2prob(logr,x):
log_odds = logr.coef_ * x + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)
Function Explained
To find the log-odds for each observation, we must first create a formula that looks similar to the
one from linear regression, extracting the coefficient and the intercept.
odds = numpy.exp(log_odds)
Now that we have the odds, we can convert it to probability by dividing it by 1 plus the odds.
Let us now use the function with what we have learned to find out the probability that each
tumor is cancerous
Example
import numpy
from sklearn import linear_model
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
print(logit2prob(logr, X))
Results Explained
3.78 0.61 The probability that a tumor with the size 3.78cm is cancerous is 61%.
2.44 0.19 The probability that a tumor with the size 2.44cm is cancerous is 19%.
2.09 0.13 The probability that a tumor with the size 2.09cm is cancerous is 13%.
Tasks 1
Find the 7_Logistic Regression Python file and upload it in Google collab with data file
Insurance.csv
Run all Codes in the cell and write your understanding with output.
Task2
1. Now do some exploratory data analysis to figure out which variables have direct and clear
impact on employee retention (i.e. whether they leave the company or continue to work)
2. Plot bar charts showing impact of employee salaries on retention
3. Plot bar charts showing corelation between department and employee retention
4. Now build logistic regression model using variables that were narrowed down in step 1
5. Measure the accuracy of the model