ML Lab Experiments
ML Lab Experiments
ML Lab Experiments
AND TECHNOLOGY
LABORATORY MANUAL
Intel based desktop PC with minimum of 2.6GHZ or faster processor with at least 1
GB RAM and 40 GB free disk space and LAN connected.
Software : Turbo C.
The purpose of this is to acquaint the students with an overview of the Computer
Networks from the perspective how the information is transferred from source to
destination and different layers in networks. This course provides a basis for u. They can
understand how the data transferred from source to destination. They can come to know
that how the routing algorithms worked out in network layer understanding the
networking techniques that can take place in computer. A computer network is made of
two distinct subsets of components
Course Objectives:
The objective of this lab is to get an overview of the various machine learning
techniques and can able to demonstrate them using python.
To introduce students to the basic concepts of Data Science and techniques of
Machine Learning.
To develop skills of using recent machine learning software for solving practical
problems.
To gain experience of doing independent study and research.
Course Outcomes:
After the completion of the course the student can able to:
LAB CODE
Students should report to the concerned lab as per the time table.
Students who turn up late to the labs will in no case be permitted to do the
program schedule for the day.
After completion of the program, certification of the concerned staff in-charge in the
observation book is necessary.
Student should bring a notebook of 100 pages and should enter the readings
/observations into the notebook while performing the experiment.
The record of observations along with the detailed experimental procedure of the
experiment in the immediate last session should be submitted and certified staff
member in-charge.
Not more than 3-students in a group are permitted to perform the experiment on
the set.
The group-wise division made in the beginning should be adhered to and no mix up
of students among different groups will be permitted.
When the experiment is completed, should disconnect the setup made by them,
and should return all the components/instruments taken for the purpose.
Sr Experime
NAME OF THE EXPERIMENT Page No
No nt No.
Write a program to demonstrate the following
a) Operation of data types in Python.
b) Different Arithmetic Operations on numbers in Python.
01 Exp-01 c) Create, concatenate and print a string and access
substring from a given string.
d) Append, and remove lists in python.
e) Demonstrate working with tuples in python.
f) Demonstrate working with dictionaries in python.
Using python write a NumPy program to compute the
a) Expected Value
b) Mean
02 Exp-02 c) Standard deviation
d) Variance
e) Covariance
f) Covariance Matrix of two given arrays.
03 Exp-03
04 Exp-04
05 Exp-05
06 Exp-06
07 Exp-07
08 Exp-08
1) Arithmetic Operators
2) Relational Operators OR Comparison Operators
3) Logical operators
4) Bitwise operators
5) Assignment operators
6) Special operators
Arithmetic operators are used with numeric values to perform common mathematical
operations:
There are 7 arithmetic operators in Python :
Source Code:
a=10
b=2
print('a+b=',a+b)
print('a-b=',a-b)
print('a*b=',a*b)
print('a/b=',a/b)
print('a//b=',a//b)
print('a%b=',a%b)
print('a**b=',a**b)
֍ / operator always performs floating point arithmetic. Hence it will always returns float value.
֍ But Floor division (//) can perform both floating point and integral arithmetic. If arguments
are int type then result is int type. If atleast one argument is float type then result is float type
c. Create, concatenate and print a string and access substring from a given string.
String:
Python string is the collection of the characters surrounded by single quotes, double quotes,
or triple quotes. The computer does not understand the characters; internally, it stores
manipulated character as the combination of the 0's and 1's.
Each character is encoded in the ASCII or Unicode character. So we can say that Python
strings are also called the collection of Unicode characters.
Syntax:
String concatenation is a process when one string is merged with another string. It can be
done in the following ways.
1. Using + operator
# Defining strings
str1 = "Online Smart "
str2 = "Trainer"
The join ( ) method is used to join the string in which the str separator has joined the
sequence elements.
str1 = "Online"
str2 = "SmartTrainer"
# join() method is used to combine the string with a separator Space(" ")
str3 = " ".join([str1, str2])
print(str3)
end = s[3:]
#result
print ("Resultant substring from start:", start)
List
Example
Create a List:
thislist = ["apple", "banana", "cherry"]
print(thislist)
List Items
List items are ordered, changeable, and allow duplicate values.
List items are indexed has index [0], the second item has index [1] etc., the first item
List Length
To determine how many items a list has, use the len() function:
Example
Print the number of items in the list:
thislist = ["apple", "banana", "cherry"]
print(len(thislist))
OUTPUT
3
type()
From Python's perspective, lists are defined as objects with the data type 'list':
<class 'list'>
Example
What is the data type of a list?
mylist = ["apple", "banana", "cherry"]
print(type(mylist))
OUTPUT
List Methods
Python has a set of built-in methods that you can use on lists.
Method Description
extend() Add the elements of a list (or any iterable), to the end of the current list
index() Returns the index of the first element with the specified value
#Create a list
thislist = ["apple", "banana", "cherry"]
print(thislist)
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
Example
Create a Tuple:
thistuple = ("apple", "banana", "cherry")
print(thistuple)
OUTPUT
('apple', 'banana', 'cherry')
Tuple Items
Tuple items are ordered, unchangeable, and allow duplicate values.
Tuple items are indexed, the first item has index [0], the second item has index [1] etc.
Tuple Length
To determine how many items a tuple has, use the len() function:
OUTPUT
3
OUTPUT
<class 'tuple'>
#NOT a tuple
thistuple = ("apple")
print(type(thistuple))
OUTPUT
<class 'str'>
type()
From Python's perspective, tuples are defined as objects with the data type 'tuple':
<class 'tuple'>
Example
What is the data type of a tuple?
mytuple = ("apple", "banana", "cherry")
print(type(mytuple))
OUTPUT
ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 19
<class 'tuple'>
Practice:
#Create a tuple
thistuple = ("apple", "banana", "cherry")
print(thistuple)
#Delete a tuple
thistuple = ("apple", "banana", "cherry")
del thistuple
print(thistuple) #this will raise an error because the tuple no longer exists
Dictionary
OUTPUT
{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}
Dictionary Items
Dictionary items are unordered, changeable, and does not allow duplicates.
Dictionary items are presented in key:value pairs, and can be referred to by using the key name.
Example
Print the "brand" value of the dictionary:
thisdict ={
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict["brand"])
OUTPUT
Ford
Dictionary Length
To determine how many items a dictionary has, use the len() function:
Example
Print the number of items in the dictionary:
print(len(thisdict))
OUTPUT
3
type()
From Python's perspective, dictionaries are defined as objects with the data type 'dict':
<class 'dict'>
Example
Print the data type of a dictionary:
ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 22
thisdict ={
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(type(thisdict))
OUTPUT
<class 'dict'>
# using dict()
my_dict = dict({1:'apple', 2:'ball'})
OUTPUT
# Output: Jack
print(my_dict['name'])
# Output: 26
print(my_dict.get('age'))
# KeyError
print(my_dict['address'])
OUTPUT
# update value
my_dict['age'] = 27
OUTPUT
OUTPUT
Dictionary Methods
Python has a set of built-in methods that you can use on dictionaries.
Method Description
items() Returns a list containing a tuple for each key value pair
a) Expected Value
In probability, the average value of some random variable X is called the expected value or
the expectation.
For example, the following probability distribution tells us the probability that a certain
soccer team scores a certain number of goals in a given game:
To find the expected value, E(X), or mean μ of a discrete random variable X, simply multiply
each value of the random variable by its probability and add the products. The formula is
given as E(X)=μ=∑xP(x).
Here x represents values of the random variable X, P(x) represents the corresponding
probability, and symbol ∑ represents the sum of all products xP(x). Here we use symbol μ for
the mean because it is a parameter. It represents the mean of a population.
For example, the expected number of goals for the soccer team would be calculated as:
import numpy as np
#define probabilities
probs = [.18, .34, .35, .11, .02]
b) Mean:
The sum of elements, along with an axis divided by the number of elements, is known
as arithmetic mean. In other words it is the sum divided by the count.
The numpy.mean() function is used to compute the arithmetic mean along the specified
axis.
This function returns the average of the array elements. By default, the average is taken on
the flattened array.
# 1D array
a = [20, 2, 7, 1, 34]
print("array : ", a)
print("Mean of array : ", np.mean(a))
# 2D Array
import numpy as np
x = np.array([[10, 30], [20, 60]])
print("Original array:")
print(x)
print("Mean of each column:")
print(x.mean(axis=0))
print("Mean of each row:")
print(x.mean(axis=1))
In probability theory and statistics, variance is the expectation of the squared deviation of
a random variable from its population mean or sample mean.
The average of the squared differences from the Mean.
import numpy as np
# 1D array
a = [20, 2, 7, 1, 34]
print("Array : ", a)
print("Variance of array : ", np.var(a))
# 2D array
arr = [[2, 2, 2, 2, 2],
[15, 6, 27, 8, 2],
[23, 2, 54, 1, 2, ],
d) Standard deviation
Its symbol is σ (the greek letter sigma)
It is the square root of the Variance.
The numpy module of Python provides a function called numpy.std(), used to compute the
standard deviation along the specified axis. This function returns the standard deviation of
the array elements.
The square root of the average square deviation (computed from the mean), is known as the
standard deviation.
By default, the standard deviation is calculated for the flattened array.
Source Program-1
import numpy as np
# Original array
array = np.arange(10)
print(array)
r1 = np.mean(array)
print("\nMean: ", r1)
r2 = np.std(array)
print("\nstd: ", r2)
r3 = np.var(array)
print("\nvariance: ", r3)
import numpy as np
# Original array
array = np.arange(10)
print(array)
r1 = np.average(array)
print("\nMean: ", r1)
Source Program-3
# 2D array
a=np.array([[1,4,7,10],[2,5,8,11]])
For example, the covariance between two random variables X and Y can be calculated using
the following formula (for population):
Source Program-2
import numpy as np
array1 = np.array([0, 1, 1])
array2 = np.array([2, 2, 1])
# Original array1
print(array1)
# Original array2
print(array2)
# Covariance matrix
print("\nCovariance matrix of the said arrays:\n", np.cov(array1, array2))
import numpy as np
array1 = np.array([1, 2])
array2 = np.array([1, 2])
# Original array1
print(array1)
# Original array2
print(array2)
# Covariance matrix
print("\nCovariance matrix of the said arrays:\n",np.cov(array1, array2))
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data pre-processing task.
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.
Data pre-processing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learning model.
It involves below steps:
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets.
It is useful for huge datasets and can use these datasets in programs.
Here we will use a demo dataset for data pre-processing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning
For real-world problems, we can download datasets online from various sources such as
1) https://www.kaggle.com/uciml/datasets
2) https://archive.ics.uci.edu/ml/index.php
We can also create our dataset by gathering data using various API with Python and put that
data into a .csv file.
2) Importing Libraries
In order to perform data pre-processing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data pre-processing, which are:
1. Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:
import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
2. Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:
3. Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below code:
Note: We can set any directory as a working directory, but it must contain the required dataset.
Here, in the below image, we can see the Python file along with required dataset. Now, the current
folder is set as a working directory.
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read
a csv file and performs various operations on it. Using this function, we can read a csv file locally as
well as through an URL.
data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully import
the dataset in our code. We can also check the imported dataset by clicking on the section variable
explorer, and then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python. We can
also change the format of our dataset by clicking on the format option.
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract
the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the
columns. Here we have used :-1, because we don't want to take the last column as it contains the
dependent variable. So by doing this, we will get the matrix of features.
As we can see in the above output, there are only three variables.
y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent
variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this
way, we just delete the specific row or column which consists of null values. But this way is not so
efficient and removing data may lead to loss of information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row which
contains any missing value and will put it on the place of missing value. This strategy is useful for
the features which have numeric data such as age, salary, year, etc. Here, we will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains various
libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
#Handling missing data (Replacing missing data with the mean value)
imputer.fit(x[:, 1:3])
print(x)
Output:
Since machine learning model completely works on mathematics and numbers, but if our dataset
would have a categorical variable, then it may create trouble while building the model. So it is
necessary to encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
Output:
Explanation:
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided
into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
Suppose, if we have given training to our machine learning model by a dataset and we test it by a
completely different dataset. Then, it will create difficulties for our model to understand the
correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a new
dataset to it, then it will decrease the performance. So we always try to make a machine learning
model which performs well with the training set and also with the test dataset. Here, we can define
these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know the
output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.
For splitting the dataset, we will use the below lines of code:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=1)
print(x_train)
print(x_test)
print(y_train)
print(y_test)
print('\n')
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into random train and test subsets.
o In the second line, we have used four variables for our output that are
o In train_test_split() function, we have passed four parameters in which first two are for arrays of data,
and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells the dividing
ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you always get the same
result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under the variable
explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put
our variables in the same range and in the same scale so that no any variable dominate the other
variable.
If we compute any two values from age and salary, then salary values will dominate the age values,
and it will produce an incorrect result. So to remove this issue, we need to perform feature scaling
for machine learning.
ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 51
There are two ways to perform feature scaling in machine learning:
Standardization
Normalization
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
Now, we will create the object of StandardScaler class for independent variables or features. And
then we will fit and transform the training dataset.
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform() because it is
already done in training set.
x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:
Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1.
But if these variables will have more range of values, then we will also need to scale those
variables.
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
import numpy as np
import matplotlib.pyplot as mtp
import pandas as pd
data_set= pd.read_csv('Data.csv')
print(data_set)
print("\n")
x= data_set.iloc[:,:-1].values
print(x)
print('\n')
y=data_set.iloc[:,3].values
print(y)
print('\n')
#Handling missing data (Replacing missing data with the mean value)
from sklearn.impute import SimpleImputer
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
print(x)
print('\n')
x = np.array(ct.fit_transform(x))
print(x)
print('\n')
# Encoding the Dependent Variable
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
# Splitting the dataset into the Training set and Test set
print(x_train)
print(x_test)
print(y_train)
print(y_test)
print('\n')
# Feature Scaling
from sklearn.preprocessing import StandardScaler
st_x = StandardScaler()
print(x_train)
print(x_test)
In the above code, we have included all the data pre-processing steps together. But there are some
steps or lines of code which are not necessary for all machine learning models. So we can exclude
them from our code to make it reusable for all models.
o We want to find out if there is any correlation between these two variables
In this section, we will create a Simple Linear Regression model to find out the best fitting line for representing
the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we need to follow the
below steps:
The first step for creating the Simple Linear Regression model is data pre-processing. But there will be some
changes, which are given in the below steps:
o First, we will import the three important libraries, which will help us for loading the dataset, plotting the graphs,
and creating the Simple Linear Regression model.
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
o Next, we will load the dataset into our code: Download dataset
data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE screen by clicking on
the variable explorer option.
o After that, we need to extract the dependent and independent variables from the given dataset. The independent
variable is years of experience, and the dependent variable is salary. Below is code for it:
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column from the
dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the second column and indexing
starts from the zero.
By executing the above line of code, we will get the output for X and Y variable as:
o Next, we will split both variables into the test set and training set. We have 30 observations, so we will take 20
observations for the training set and 10 observations for the test set. We are splitting our dataset so that we can
train our model using a training dataset and then test the model using a test dataset. The code for this is given
below:
By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider the
below images:
Test-dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries take care of it for some
cases, so we don't need to perform it here. Now, our dataset is well prepared to work on it and we are going to
start building a Simple Linear Regression model for the given problem.
Now the second step is to fit our model to the training dataset. To do so, we will import the LinearRegression class
of the linear_model library from the scikit learn. After importing the class, we are going to create an object of the class
named as a regressor. The code for this is given below:
In the above code, we have used a fit() method to fit our Simple Linear Regression object to the
training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset
for the dependent and an independent variable. We have fitted our regressor object to the training
set so that the model can easily learn the correlations between the predictor and target variables.
After executing the above lines of code, we will get the below output.
Output:
Dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the output for the
new observations. In this step, we will provide the test dataset (new observations) to the model to check whether it can
predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of
training set respectively.
On executing the above lines of code, two variables named y_pred and x_pred will generate in the variable explorer
options that contain salary predictions for the training set and test set.
Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the pyplot library,
which we have already imported in the pre-processing step. The scatter () function will create a scatter plot of
observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function,
we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and
color of the observations. Here we are taking a green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this
function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the
line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass the name
("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:
Output:
By executing the above lines of code, we will get the below graph plot as an output.
ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 62
In the above plot, we can see the real values observations in green dots and predicted values are
covered by the red regression line. The regression line shows a correlation between the dependent
and independent variable.
The good fit of the line can be observed by calculating the difference between actual values and
predicted values. But as we can see in the above plot, most of the observations are close to the
regression line, hence our model is good for the training set.
In the previous step, we have visualized the performance of our model on the training set. Now, we will do the same for
the Test set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test
instead of x_train and y_train.
Here we are also changing the color of observations and regression line to differentiate between the two plots, but it is
optional.
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is given by the red regression line. As
we can see, most of the observations are close to the regression line, hence we can say our Simple Linear Regression is
a good model and able to make good predictions.
#importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
data_set= pd.read_csv('Salary_Data.csv')
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
Resources:
salary_data.csv
simple_linear_regression.py
about UserID, Gender, Age, EstimatedSalary, and Purchased. Use this dataset for predicting that a
user will purchase the company’s newly launched product or not by Logistic Regression model.
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use
it in our code efficiently. It will be the same as we have done in Data pre-processing topic. The code
for this is given below:
By executing the above lines of code, we will get the dataset as the output. Consider the given
image:
Now, we will extract the dependent and independent variables from the given dataset. Below is the
code for it:
In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at
index 4. The output will be:
In logistic regression, we will do feature scaling because we want accurate result of predictions.
Here we will only scale the independent variable because dependent variable have only 0 and 1
values. Below is the code for it:
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 69
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:
Output: By executing the above code, we will get the below output:
Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
Now we will create the confusion matrix here to check the accuracy of the classification. To create it,
we need to import the confusion_matrix function of the sklearn library. After importing the function,
we will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual
values) and y_pred (the targeted value return by the classifier). Below is the code for it:
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken
are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided
colors (purple and green). In this function, we have passed the classifier.predict to show the
predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:
o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result
for purchased variables.
We have successfully visualized the training set result for the logistic regression, and our goal for
this classification is to divide the users who purchased the SUV car and who did not purchase the
car. So from the output graph, we can clearly see the two regions (Purple and Green) with the
observation points. The Purple region is for those users who didn't buy the car, and Green Region is
for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used the
Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we will
use x_test and y_test instead of x_train and y_train. Below is the code for it:
Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations are in
the purple region. So we can say it is a good prediction and model. Some of the green and purple
data points are in different regions, which can be ignored as we have already calculated this error
using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification problem.
as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building
the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying
articles.
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that
it is an apple without depending on each other.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
Where,
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.
Steps to implement:
o Data Pre-processing step
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
print ('\n the total number of Training Data :',y_train.shape)
print ('\n the total number of Test Data :',y_test.shape)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
In the above code, we have loaded the dataset into our program using
"dataset = pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then we
In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We can
also use other classifiers as per our requirement.
Output:
Output:
Output:
Output:
Output:
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can use
Theory
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
For illustration, consider the following example. Suppose we attempt to turn on our computer, but
the computer does not start (observation/evidence). We would like to know which of the possible
causes of computer failure is more likely. In this simplified illustration, we assume only two possible
causes of this misfortune: electricity failure and computer malfunction.
Fig: Directed acyclic graph representing two independent possible causes of a computer failure.
Data Set:
Title: Heart Disease Databases
The Cleveland database contains 76 attributes, but all published experiments refer to using a subset
of 14 of them. In particular, the Cleveland database is the only one that has been used by ML
researchers to this date. The "Heartdisease" field refers to the presence of heart disease in the
patient. It is integer valued from 0 (no presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Heartdisease
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
60 1 4 130 206 0 2 132 1 2.4 2 2 7 4
Program:
import numpy as np
import pandas as pd
import csv
import pgmpy
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.models import BayesianModel
from pgmpy.inference import VariableElimination
heartDisease = pd.read_csv('heart.csv')
heartDisease = heartDisease.replace('?',np.nan)
model=
BayesianModel([('age','heartdisease'),('sex','heartdisease'),('exang','heartdisease'),('cp','heartdisease'),
('heartdisease','restecg'),('heartdisease','chol')])
print('\nLearning CPD using Maximum likelihood estimators')
model.fit(heartDisease,estimator=MaximumLikelihoodEstimator)
appropriate data set for building the decision tree and apply this knowledge to classify a new
sample.
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used
to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do
not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree
algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an
attribute.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having
the highest information gain is split first. It can be calculated using the below formula:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Where,
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the CART (Classification and
Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Steps will also remain the same, which are given below:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
#importing datasets
data_set= pd.read_csv('user_data.csv')
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is
given as:
In the above code, we have created a classifier object, in which we have passed two main
parameters;
o "criterion='entropy': Criterion is used to measure the quality of split, which is calculated by information gain
given by entropy.
Out[8]:
DecisionTreeClassifier(class_weight=None,criterion='entropy',
max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
Output:
In the below output image, the predicted output and real test output are given. We can clearly see
that there are some values in the prediction vector, which are different from the real vector values.
These are prediction errors.
Output:
Output:
The above output is completely different from the rest classification models. It has both vertical and
horizontal lines that are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
Output:
As we can see in the above image that there are some green data points within the purple region
and vice versa. So, these are the incorrect predictions which we have discussed in the confusion
matrix.
OR
Algorithm Concepts
Entropy (Attribute) - Entropy of Attribute calculated in same as we calculated for System (Whole
Data-Set)
Information Gain
1. If all examples are positive, Return the single-node tree ,with label=+
2. If all examples are Negative, Return the single-node tree,with label= -
3. If Attribute empty, Return the single-node tree
1. The attribute that has the most information gain has to create a group of all the its attributes and
process them in same as which we have done for the parent (Root) node.
2. Again, the feature which has maximum information gain will become a node and this process
will continue until we get the leaf node.
Output
def entropy(probs):
import math
def entropy_of_list(a_list):
from collections import Counter
num_instances = len(a_list)
total_entropy = entropy_of_list(df_tennis['PT'])
Output
collections.Counter()
A counter is a container that stores elements as dictionary keys, and their counts are stored as
dictionary values.
print("target_attribute_name",target_attribute_name)
df_split = df.groupby(split_attribute_name)
print("Name: ",name)
print("Group: ",group)
print("NOBS",nobs)
print("df_agg_ent",df_agg_ent)
old_entropy = entropy_of_list(df[target_attribute_name])
In the same way, we will Calculate the information gain of the remaining attributes and then the
attribute who has the most information will be named the best attribute
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
o The computation cost is high because of calculating the distance between the data points for all the training
samples.
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new
SUV car. The company wants to give the ads to the users who are interested in buying that SUV. So
for this problem, we have a dataset that contains multiple user's information through the social
network. The dataset contains lots of information but the Estimated Salary and Age we will
consider for the independent variable and the Purchased variable is for the dependent variable.
Below is the dataset:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code
for it:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 110
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed. After
feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
And then we will fit the classifier to the training data. Below is the code for it:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in Logistic
Regression. Below is the code for it:
Output:
In above code, we have imported the confusion_matrix function and called it using the variable cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say that
the performance of the model is improved by using the K-NN algorithm.
Now, we will visualize the training set result for K-NN model. The code will remain same as
we did in Logistic Regression, except the name of the graph. Below is the code for it:
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
The output graph is different from the graph which we have occurred in Logistic Regression. It can
be understood in the below points:
o The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who didn't buy
the SUV are in the red region and users who bought the SUV are in the green region.
o The graph is showing good result but still, there are some green points in the red region and
red points in the green region. But this is no big issue as by doing this model is prevented
OR
iris=datasets.load_iris()
x = iris.data
y = iris.target
print(x)
print('class: 0-Iris-Setosa, 1- Iris-Versicolour, 2- Iris-Virginica')
print(y)
# Splits the dataset into 70% train data and 30% test data. This means that out of total 150 records,
classifier.fit(x_train, y_train)
print('Confusion Matrix')
print(confusion_matrix(y_test,y_pred))
print('Accuracy Metrics')
print(classification_report(y_test,y_pred))