ML Lab Experiments

LORDS INSTITUTE OF ENGINEERING
AND TECHNOLOGY
LABORATORY MANUAL
Course Name : Machine Learning Lab

Department : CSM (CSE-AI & ML)
Year / Semester : II-II
ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 1

OBJECTIVES

Analyze the different layers in networks.

Define, use, and differentiate such concepts as OSI-ISO,TCP/IP.

How to send bits from physical layer to data link layer

Sending frames from data link layer to Network layer

Different algorithms in Network layer

Analyze the presentation layer, application layer

They can understand how the data transferred from source to destination

They can come to know that how the routing algorithms worked out in network layer
Recommended System/Software Requirements
 Intel based desktop PC with minimum of 2.6GHZ or faster processor with at least 1
GB RAM and 40 GB free disk space and LAN connected.
 Operating system: Flavor of any WINDOWS or UNIX/LINUX.
 Software : Turbo C.

Machine Learning Lab
The purpose of this is to acquaint the students with an overview of the Computer
Networks from the perspective how the information is transferred from source to
destination and different layers in networks. This course provides a basis for u. They can
understand how the data transferred from source to destination. They can come to know
that how the routing algorithms worked out in network layer understanding the
networking techniques that can take place in computer. A computer network is made of
two distinct subsets of components
Distributed applications are programs running on interconnected computers; a web

server, a remote login server, an e-mail exchanger are examples. This is the visible part
of what people call “the Internet”. In this lecture we will study the simplest aspects of
distributed applications. More sophisticated aspects are the object of lectures called
“Distributed Systems” and “Information Systems”. The network infrastructure is the
collection of systems which are required for the interconnection of computers running
the distributed applications. It is the main focus of this lecture. The network
infrastructure problem has itself two aspects: Distance: interconnect remote systems
that are too far apart for a direct cable connection Meshing: interconnect systems
together; even in the case of systems close to each other, it is not possible in non-trivial
cases to put cables from all systems to all systems (combinatorial explosion, cable salad
management problem s etc.).
Course Objectives:
 The objective of this lab is to get an overview of the various machine learning
techniques and can able to demonstrate them using python.
 To introduce students to the basic concepts of Data Science and techniques of
Machine Learning.
 To develop skills of using recent machine learning software for solving practical
problems.
 To gain experience of doing independent study and research.
Course Outcomes:
 After the completion of the course the student can able to:

 After learning the contents of this paper the student must be able to design and
implement machine learning solutions to classification, regression problems.
 Understand complexity of Machine Learning algorithms and their limitations
 Be capable of confidently applying common Machine Learning algorithms in
practice and implementing their own;
 Be capable of performing experiments in Machine Learning using real-world data.∙
 Able to evaluate and interpret the results of the algorithms.
LAB CODE
 Students should report to the concerned lab as per the time table.
 Students who turn up late to the labs will in no case be permitted to do the
program schedule for the day.
 After completion of the program, certification of the concerned staff in-charge in the
observation book is necessary.
 Student should bring a notebook of 100 pages and should enter the readings
/observations into the notebook while performing the experiment.
 The record of observations along with the detailed experimental procedure of the
experiment in the immediate last session should be submitted and certified staff
member in-charge.
 Not more than 3-students in a group are permitted to perform the experiment on
the set.
 The group-wise division made in the beginning should be adhered to and no mix up
of students among different groups will be permitted.
 The components required pertaining to the experiment should be collected from

stores in-charge after duly filling in the requisition form.
 When the experiment is completed, should disconnect the setup made by them,
and should return all the components/instruments taken for the purpose.
 Any damage of the equipment or burn-out components will be viewed seriously

either by putting penalty or by dismissing the total group of students from the lab
for the semester/year.
 Students should be present in the labs for total scheduled duration.
 Students are required to prepare thoroughly to perform the experiment before

coming to laboratory.
INDEX
Sr Experime
NAME OF THE EXPERIMENT Page No
No nt No.
Write a program to demonstrate the following
a) Operation of data types in Python.
b) Different Arithmetic Operations on numbers in Python.
01 Exp-01 c) Create, concatenate and print a string and access
substring from a given string.
d) Append, and remove lists in python.
e) Demonstrate working with tuples in python.
f) Demonstrate working with dictionaries in python.
Using python write a NumPy program to compute the
a) Expected Value
b) Mean
02 Exp-02 c) Standard deviation
d) Variance
e) Covariance
f) Covariance Matrix of two given arrays.
03 Exp-03
04 Exp-04
05 Exp-05
06 Exp-06
07 Exp-07
08 Exp-08
a) Operation of data types in Python.
 Operator is a symbol that performs certain operations.

 Python provides the following set of operators
1) Arithmetic Operators
2) Relational Operators OR Comparison Operators
3) Logical operators
4) Bitwise operators
5) Assignment operators
6) Special operators

b.Different Arithmetic Operations on numbers in Python.
 Arithmetic operators are used with numeric values to perform common mathematical
operations:
 There are 7 arithmetic operators in Python :
Source Code:
a=10
b=2
print('a+b=',a+b)
print('a-b=',a-b)
print('a*b=',a*b)
print('a/b=',a/b)
print('a//b=',a//b)
print('a%b=',a%b)
print('a**b=',a**b)

Note:
֍ / operator always performs floating point arithmetic. Hence it will always returns float value.
֍ But Floor division (//) can perform both floating point and integral arithmetic. If arguments
are int type then result is int type. If atleast one argument is float type then result is float type
c. Create, concatenate and print a string and access substring from a given string.
String:
 Python string is the collection of the characters surrounded by single quotes, double quotes,
or triple quotes. The computer does not understand the characters; internally, it stores
manipulated character as the combination of the 0's and 1's.
 Each character is encoded in the ASCII or Unicode character. So we can say that Python
strings are also called the collection of Unicode characters.
 Syntax:
str = "Online Smart Trainer !"

Creating and Printing String in Python
 We can create a string by enclosing the characters in single-quotes or double- quotes.
Python also provides triple-quotes to represent the string, but it is generally used for
multiline string or docstrings.
#Using single quotes

str1 = 'My name is Abdul Rais'
print(str1)
#Using double quotes
str2 = "I am Online Smart Trainer"
print(str2)
#Using triple quotes

str3 = '''''Triple quotes are generally used for
represent the multiline or
docstring'''
print(str3)
Concatenate two strings in Python
 String concatenation is a process when one string is merged with another string. It can be
done in the following ways.

o Using + operators
o Using join() method
1. Using + operator
 This is an easy way to combine the two strings.

 The + operator adds the multiple strings together.
 Strings must be assigned to the different variables because strings are immutable.
# Defining strings
str1 = "Online Smart "
str2 = "Trainer"
# + Operator is used to strings concatenation

str3 = str1 + str2
print(str3) # Printing the new combined string
2. Using join( ) method:
 The join ( ) method is used to join the string in which the str separator has joined the
sequence elements.
str1 = "Online"
str2 = "SmartTrainer"

# join() method is used to combine the strings
print("".join([str1, str2]))
# join() method is used to combine the string with a separator Space(" ")
str3 = " ".join([str1, str2])
print(str3)
# Initialise the string

s = 'substring in python'
print("Initial string: ", s)
# creating substring from beginning
# define upto what index substring is needed
start = s[:2]
end = s[3:]
#result
print ("Resultant substring from start:", start)
print ("Resultant substring from end:", end)

d. Append, and remove lists in python.
List
 Lists are used to store multiple items in a single variable.

 Lists are created using square brackets:
Example
Create a List:
thislist = ["apple", "banana", "cherry"]
print(thislist)
List Items
 List items are ordered, changeable, and allow duplicate values.
 List items are indexed has index [0], the second item has index [1] etc., the first item
List Length
 To determine how many items a list has, use the len() function:
Example
Print the number of items in the list:
print(len(thislist))
OUTPUT
3
List Items - Data Types

 List items can be of any data type:
Example
String, int and boolean data types:
list1 = ["apple", "banana", "cherry"]
list2 = [1, 5, 7, 9, 3]
list3 = [True, False, False]
 A list can contain different data types:

Example
A list with strings, integers and boolean values:
list1 = ["abc", 34, True, 40, "male"]
type()
 From Python's perspective, lists are defined as objects with the data type 'list':
<class 'list'>
Example
What is the data type of a list?
mylist = ["apple", "banana", "cherry"]
print(type(mylist))
OUTPUT

<class 'list'>
Python List Methods

Methods that are available with list objects in Python programming are tabulated below.
They are accessed as list.method(). Some of the methods have already been used above.
List Methods
Python has a set of built-in methods that you can use on lists.
Method Description
append() Adds an element at the end of the list
clear() Removes all the elements from the list
copy() Returns a copy of the list
count() Returns the number of elements with the specified value
extend() Add the elements of a list (or any iterable), to the end of the current list
index() Returns the index of the first element with the specified value
insert() Adds an element at the specified position
pop() Removes the element at the specified position
remove() Removes the item with the specified value
reverse() Reverses the order of the list
sort() Sorts the list
#Create a list
print(thislist)
OUTPUT

#Access list items
print(thislist[1])
OUTPUT
#Change the value of a list item

thislist[1] = "blackcurrant"
print(thislist)
OUTPUT
#Loop through a list

for x in thislist:
print(x)
OUTPUT
#Check if a list item exists

if "apple" in thislist:
print("Yes, 'apple' is in the fruits list")
OUTPUT

#Get the length of a list
print(len(thislist))
OUTPUT
#Add an item to the end of the a list

thislist.append("orange")
print(thislist)
OUTPUT
#Add an item at a specified index

thislist.insert(1, "orange")
print(thislist)
OUTPUT

#Remove an item
thislist.remove("banana")
print(thislist)
OUTPUT
#Remove the last item

thislist.pop()
print(thislist)
OUTPUT
#Remove an item at a specified index

del thislist[0]
print(thislist)
OUTPUT

#Empty list

thislist.clear()
print(thislist)
OUTPUT
e. Write a program to demonstrate working with tuples in python.

Tuple
 Tuples are used to store multiple items in a single variable.A tuple is a collection which is ordered and
unchangeable.
 Tuples are written with round brackets.
Example
Create a Tuple:
thistuple = ("apple", "banana", "cherry")
print(thistuple)
OUTPUT
('apple', 'banana', 'cherry')
Tuple Items
 Tuple items are ordered, unchangeable, and allow duplicate values.
 Tuple items are indexed, the first item has index [0], the second item has index [1] etc.
Tuple Length
 To determine how many items a tuple has, use the len() function:

Example
Print the number of items in the tuple:
print(len(thistuple))
OUTPUT
3
Create Tuple With One Item

 To create a tuple with only one item, you have to add a comma after the item, otherwise Python will
not recognize it as a tuple.
Example
One item tuple, remember the commma:
thistuple = ("apple",)
print(type(thistuple))
OUTPUT
<class 'tuple'>
#NOT a tuple
thistuple = ("apple")
print(type(thistuple))
OUTPUT
<class 'str'>
 Tuple Items - Data Types

 Tuple items can be of any data type:
Example
String, int and boolean data types:
tuple1 = ("apple", "banana", "cherry")
tuple2 = (1, 5, 7, 9, 3)
tuple3 = (True, False, False)
A tuple can contain different data types:

Example
A tuple with strings, integers and boolean values:
tuple1 = ("abc", 34, True, 40, "male")
type()
 From Python's perspective, tuples are defined as objects with the data type 'tuple':
<class 'tuple'>
Example
What is the data type of a tuple?
mytuple = ("apple", "banana", "cherry")
print(type(mytuple))
OUTPUT
<class 'tuple'>
Practice:
#Create a tuple
print(thistuple)
#Access tuple items

print(thistuple[1])
#Loop through a tuple

thistuple[1] = "blackcurrant"
# the value is still the same:
print(thistuple)
#Check if a tuple item exists

if "apple" in thistuple:
print("Yes, 'apple' is in the fruits tuple")
#Get the length of a tuple

print(len(thistuple))
#Delete a tuple
del thistuple
print(thistuple) #this will raise an error because the tuple no longer exists
#Using the tuple() constructor to create a tuple

thistuple = tuple(("apple", "banana", "cherry"))
print(thistuple)

f. Source Code - Write a program to demonstrate working with dictionary in python.
Dictionary
 Dictionaries are used to store data values in key:value pairs.

 A dictionary is a collection which is unordered, changeable and does not allow duplicates.
 Dictionaries are written with curly brackets, and have keys and values:
Example
Create and print a dictionary:
thisdict ={
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)
OUTPUT
{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}
Dictionary Items
 Dictionary items are unordered, changeable, and does not allow duplicates.
 Dictionary items are presented in key:value pairs, and can be referred to by using the key name.
Example
Print the "brand" value of the dictionary:
thisdict ={
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict["brand"])
OUTPUT
Ford
Dictionary Length
 To determine how many items a dictionary has, use the len() function:
Example
Print the number of items in the dictionary:
print(len(thisdict))
OUTPUT
3
Dictionary Items - Data Types

 The values in dictionary items can be of any data type:
Example
String, int, boolean, and list data types:
thisdict ={
"brand": "Ford",
"electric": False,
"year": 1964,
"colors": ["red", "white", "blue"]
}
type()
From Python's perspective, dictionaries are defined as objects with the data type 'dict':
<class 'dict'>
Example
Print the data type of a dictionary:
thisdict ={
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(type(thisdict))
OUTPUT
<class 'dict'>
Source Code - #Creating Python Dictionary

# empty dictionary
mydict = {}
# dictionary with integer keys

my_dict = {1: 'apple', 2: 'ball'}
# dictionary with mixed keys

my_dict = {'name': 'John', 1: [2, 4, 3]}
# using dict()
my_dict = dict({1:'apple', 2:'ball'})
# from sequence having each item as a pair

my_dict = dict([(1,'apple'), (2,'ball')])
OUTPUT
Source Code - #Accessing Elements from Dictionary

# get vs [] for retrieving elements
my_dict = {'name': 'Jack', 'age': 26}
# Output: Jack
print(my_dict['name'])
# Output: 26
print(my_dict.get('age'))
# Trying to access keys which doesn't exist throws error

# Output None
print(my_dict.get('address'))
# KeyError
print(my_dict['address'])
OUTPUT
Source Code - #Changing and Adding Dictionary elements

# Changing and adding Dictionary Elements
my_dict = {'name': 'Jack', 'age': 26}
# update value
my_dict['age'] = 27
#Output: {'age': 27, 'name': 'Jack'}

print(my_dict)

# add item
my_dict['address'] = 'Downtown'
# Output: {'address': 'Downtown', 'age': 27, 'name': 'Jack'}

print(my_dict)
OUTPUT
Source Code - #Removing elements from Dictionary
# Removing elements from a dictionary

# create a dictionary
squares = {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
# remove a particular item, returns its value

# Output: 16
print(squares.pop(4))
# Output: {1: 1, 2: 4, 3: 9, 5: 25}

print(squares)
# remove an arbitrary item, return (key,value)

# Output: (5, 25)
print(squares.popitem())
# Output: {1: 1, 2: 4, 3: 9}
print(squares)

# remove all items
squares.clea()
# Output: {}
print(squares)
# delete the dictionary itself
del squares
# Throws Error
print(squares)
OUTPUT
Dictionary Methods
Python has a set of built-in methods that you can use on dictionaries.
Method Description
clear() Removes all the elements from the dictionary
copy() Returns a copy of the dictionary
fromkeys() Returns a dictionary with the specified keys and value
get() Returns the value of the specified key
items() Returns a list containing a tuple for each key value pair
keys() Returns a list containing the dictionary's keys
pop() Removes the element with the specified key
popitem() Removes the last inserted key-value pair

setdefault() Returns the value of the specified key. If the key does not exist: insert the key, with the
specified value
update() Updates the dictionary with the specified key-value pairs
values() Returns a list of all the values in the dictionary
Experiment-02: Using python write a NumPy program to compute the

a) Expected Value
b) Mean
c) Variance
d) Standard deviation
e) Covariance
a) Expected Value
 In probability, the average value of some random variable X is called the expected value or
the expectation.
 For example, the following probability distribution tells us the probability that a certain
soccer team scores a certain number of goals in a given game:
 To find the expected value, E(X), or mean μ of a discrete random variable X, simply multiply
each value of the random variable by its probability and add the products. The formula is
given as E(X)=μ=∑xP(x).
 Here x represents values of the random variable X, P(x) represents the corresponding
probability, and symbol ∑ represents the sum of all products xP(x). Here we use symbol μ for
the mean because it is a parameter. It represents the mean of a population.
 For example, the expected number of goals for the soccer team would be calculated as:

μ = 0*0.18 + 1*0.34 + 2*0.35 + 3*0.11 + 4*0.02 = 1.45 goals.
import numpy as np
def expected_value(values, probs):

values = np.asarray(values)
probs = np.asarray(probs)
return (values * probs).sum() / probs.sum()
#define values
values = [0, 1, 2, 3, 4]
#define probabilities
probs = [.18, .34, .35, .11, .02]
#calculate expected value

expected_value(values, probs)
b) Mean:
 The sum of elements, along with an axis divided by the number of elements, is known
as arithmetic mean. In other words it is the sum divided by the count.
 The numpy.mean() function is used to compute the arithmetic mean along the specified
axis.
 This function returns the average of the array elements. By default, the average is taken on
the flattened array.

# Python Program illustrating
# numpy.mean() method
import numpy as np
# 1D array
a = [20, 2, 7, 1, 34]
print("array : ", a)
print("Mean of array : ", np.mean(a))
# 2D Array
import numpy as np
x = np.array([[10, 30], [20, 60]])
print("Original array:")
print(x)
print("Mean of each column:")
print(x.mean(axis=0))
print("Mean of each row:")
print(x.mean(axis=1))

c) Variance:
 In probability theory and statistics, variance is the expectation of the squared deviation of
a random variable from its population mean or sample mean.
 The average of the squared differences from the Mean.


# numpy.var() method
import numpy as np
# 1D array
a = [20, 2, 7, 1, 34]
print("Array : ", a)
print("Variance of array : ", np.var(a))

# numpy.var() method
import numpy as np
# 2D array
arr = [[2, 2, 2, 2, 2],
[15, 6, 27, 8, 2],
[23, 2, 54, 1, 2, ],

[11, 44, 34, 7, 2]]
# var of the flattened array

print("\nvar of arr, axis = None : ", np.var(arr))
# var along the axis = 0

print("\nvar of arr, axis = 0 : ", np.var(arr, axis = 0))
# var along the axis = 1

print("\nvar of arr, axis = 1 : ", np.var(arr, axis = 1))
d) Standard deviation
 Its symbol is σ (the greek letter sigma)
 It is the square root of the Variance.
 The numpy module of Python provides a function called numpy.std(), used to compute the
standard deviation along the specified axis. This function returns the standard deviation of
the array elements.
 The square root of the average square deviation (computed from the mean), is known as the
standard deviation.
 By default, the standard deviation is calculated for the flattened array.

Standard Deviation = sqrt (mean (abs(x-x.mean( ))**2
Source Program-1
import numpy as np
# Original array
array = np.arange(10)
print(array)
r1 = np.mean(array)
print("\nMean: ", r1)
r2 = np.std(array)
print("\nstd: ", r2)
r3 = np.var(array)
print("\nvariance: ", r3)

Method 2: Using the formulas
Source Program-2
import numpy as np
# Original array
array = np.arange(10)
print(array)
r1 = np.average(array)
print("\nMean: ", r1)
r2 = np.sqrt(np.mean((array - np.mean(array)) ** 2))

print("\nstd: ", r2)

r3 = np.mean((array - np.mean(array)) ** 2)
print("\nvariance: ", r3)
Source Program-3

# numpy.std() method
import numpy as np
# 2D array
a=np.array([[1,4,7,10],[2,5,8,11]])
# std of the flattened array

print("\nstd of arr, axis = None : ", np.std(a))
# std along the axis = 0

print("\nstd of arr, axis = 0 : ", np.std(a, axis=0) )
# std along the axis = 1

print("\nstd of arr, axis = 1 : ", np.std(a, axis = 1))

e) Covariance
 For example, the covariance between two random variables X and Y can be calculated using
the following formula (for population):
 For a sample covariance:

where,
Xi – the values of the X-variable
Yj – the values of the Y-variable
X̄ – the mean (average) of the X-variable
Ȳ – the mean (average) of the Y-variable
n – the number of data points

 In NumPy for computing the covariance matrix of two given arrays with help of numpy.cov().
Source Program-1
import numpy as np
x = np.array([0, 1, 2])
y = np.array([2, 1, 0])

print("\nOriginal array1:")
print(x)
print("\nOriginal array1:")
print(y)
print("\nCovariance matrix of the said arrays:\n",np.cov(x, y))
Source Program-2
import numpy as np
array1 = np.array([0, 1, 1])
array2 = np.array([2, 2, 1])
# Original array1
print(array1)
# Original array2
print(array2)
# Covariance matrix
print("\nCovariance matrix of the said arrays:\n", np.cov(array1, array2))

Source Program-3
import numpy as np
array1 = np.array([1, 2])
array2 = np.array([1, 2])
# Original array1
print(array1)
# Original array2
print(array2)
# Covariance matrix
print("\nCovariance matrix of the said arrays:\n",np.cov(array1, array2))

3. For a given set of training data examples stored in a .CSV file, demonstrate Data
Preprocessing in Machine learning with the following steps
a) Getting the dataset.
b) Importing libraries.
c) Importing datasets.
d) Finding Missing Data.
e) Encoding Categorical Data.
f) Splitting dataset into training and test set.
g) Feature scaling.
Data Preprocessing in Machine learning
 Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
 When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data pre-processing task.
Why do we need Data Preprocessing?
 A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.
 Data pre-processing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learning model.
 It involves below steps:
a) Getting the dataset.

b) Importing libraries.
c) Importing datasets.
d) Finding Missing Data.
e) Encoding Categorical Data.
f) Splitting dataset into training and test set.
g) Feature scaling.
1) Get the Dataset

 To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.
 Dataset may be of different formats for different purposes, and each dataset is different from
another dataset. To use the dataset in our code, we usually put it into a .CSV file. However,
sometimes, we may also need to use an HTML or .xlsx file.

What is a CSV File?
 CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets.
 It is useful for huge datasets and can use these datasets in programs.
 Here we will use a demo dataset for data pre-processing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning
 For real-world problems, we can download datasets online from various sources such as
1) https://www.kaggle.com/uciml/datasets
2) https://archive.ics.uci.edu/ml/index.php
 We can also create our dataset by gathering data using various API with Python and put that
data into a .csv file.
2) Importing Libraries
 In order to perform data pre-processing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data pre-processing, which are:
1. Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:
import numpy as nm
 Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
2. Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:
import matplotlib.pyplot as mpt
 Here we have used mpt as a short name for this library.
3. Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
 Here, we have used pd as a short name for this library. Consider the below code:


# Importing the libraries
import numpy as np
import matplotlib.pyplot as mtp
import pandas as pd
3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning project. But
before importing a dataset, we need to set the current directory as a working directory. To set a
working directory in Spyder IDE, we need to follow the below steps:
1. Save your Python file in the directory which contains dataset.

2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.
Note: We can set any directory as a working directory, but it must contain the required dataset.
Here, in the below image, we can see the Python file along with required dataset. Now, the current
folder is set as a working directory.

read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read
a csv file and performs various operations on it. Using this function, we can read a csv file locally as
well as through an URL.
We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully import
the dataset in our code. We can also check the imported dataset by clicking on the section variable
explorer, and then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python. We can
also change the format of our dataset by clicking on the format option.
Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent variables) and
dependent variables from dataset. In our dataset, there are three independent variables that
are Country, Age, and Salary, and one is a dependent variable which is Purchased.
Extracting independent variable:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract
the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the
columns. Here we have used :-1, because we don't want to take the last column as it contains the
dependent variable. So by doing this, we will get the matrix of features.
By executing the above code, we will get output as:
[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
As we can see in the above output, there are only three variables.
Extracting dependent variable:
To extract dependent variables, again, we will use Pandas .iloc[] method.
y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent
variables.
By executing the above code, we will get output as:
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

Note: If you are using Python language for machine learning, then extraction is mandatory, but
for R language it is not required.
4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains
some missing data, then it may create a huge problem for our machine learning model. Hence it is
necessary to handle missing values present in the dataset.
Ways to handle missing data:
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this
way, we just delete the specific row or column which consists of null values. But this way is not so
efficient and removing data may lead to loss of information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row which
contains any missing value and will put it on the place of missing value. This strategy is useful for
the features which have numeric data such as age, salary, year, etc. Here, we will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains various
libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
#Handling missing data (Replacing missing data with the mean value)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
print(x)
Output:
[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]

As we can see in the above output, the missing values have been replaced with the means of rest
column values.
5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our dataset
would have a categorical variable, then it may create trouble while building the model. So it is
necessary to encode these categorical variables into numbers.
For Country variable:
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
# Encoding categorical data

# Encoding the Independent Variable
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],
remainder='passthrough')
x = np.array(ct.fit_transform(x))
print(x)
print('\n')
# Encoding the Dependent Variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
Output:
[[1.0 0.0 0.0 44.0 72000.0]
[0.0 0.0 1.0 27.0 48000.0]
[0.0 1.0 0.0 30.0 54000.0]
[0.0 0.0 1.0 38.0 61000.0]
[0.0 1.0 0.0 40.0 63777.77777777778]
[1.0 0.0 0.0 35.0 58000.0]
[0.0 0.0 1.0 38.77777777777778 52000.0]
[1.0 0.0 0.0 48.0 79000.0]
[0.0 1.0 0.0 50.0 83000.0]

[1.0 0.0 0.0 37.0 67000.0]]
Explanation:
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided
into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:

6) Splitting the Dataset into the Training set and Test set
In machine learning data pre-processing, we divide our dataset into a training set and test set. This
is one of the crucial steps of data pre-processing as by doing this, we can enhance the performance
of our machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and we test it by a
completely different dataset. Then, it will create difficulties for our model to understand the
correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a new
dataset to it, then it will decrease the performance. So we always try to make a machine learning
model which performs well with the training set and also with the test dataset. Here, we can define
these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know the
output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.
For splitting the dataset, we will use the below lines of code:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=1)
print(x_train)
print(x_test)
print(y_train)
print(y_test)
print('\n')
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into random train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data

o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for arrays of data,
and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells the dividing
ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you always get the same
result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under the variable
explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put
our variables in the same range and in the same scale so that no any variable dominate the other
variable.
Consider the below dataset:

As we can see, the age and salary column values are not on the same scale. A machine learning
model is based on Euclidean distance, and if we do not scale the variable, then it will cause some
issue in our machine learning model.
Euclidean distance is given as:
If we compute any two values from age and salary, then salary values will dominate the age values,
and it will produce an incorrect result. So to remove this issue, we need to perform feature scaling
for machine learning.
There are two ways to perform feature scaling in machine learning:
Standardization
Normalization
Here, we will use the standardization method for our dataset.
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
from sklearn.preprocessing import StandardScaler
Now, we will create the object of StandardScaler class for independent variables or features. And
then we will fit and transform the training dataset.
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform() because it is
already done in training set.
x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:

x_test:

As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1.
But if these variables will have more range of values, then we will also need to scale those
variables.
Combining all the steps:
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
data_set= pd.read_csv('Data.csv')
print(data_set)
print("\n")
x= data_set.iloc[:,:-1].values
print(x)
print('\n')
y=data_set.iloc[:,3].values
print(y)
print('\n')
#Handling missing data (Replacing missing data with the mean value)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
print(x)
print('\n')
# Encoding categorical data

# Encoding the Independent Variable

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
print(x)
print('\n')
# Encoding the Dependent Variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
print(x_train)
print(x_test)
print(y_train)
print(y_test)
print('\n')
# Feature Scaling
st_x = StandardScaler()
x_train[:, 3:] = st_x.fit_transform(x_train[:, 3:])

x_test[:, 3:] = st_x.transform(x_test[:, 3:])
print(x_train)
print(x_test)
In the above code, we have included all the data pre-processing steps together. But there are some
steps or lines of code which are not necessary for all machine learning models. So we can exclude
them from our code to make it reusable for all models.

4. Build a linear regression model using python for a particular data set by
a) Splitting Training data and Test data.
b) Evaluate the model (intercept and slope).
c) Visualize the training set and testing set
d) Predicting the test set result
e) Compare actual output values with predicted values
Problem Statement example for Simple Linear Regression:

 Here we are taking a dataset that has two variables: salary (dependent variable) and experience (Independent
variable). The goal of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
 In this section, we will create a Simple Linear Regression model to find out the best fitting line for representing
the relationship between these two variables.
 To implement the Simple Linear regression model in machine learning using Python, we need to follow the
below steps:
Step-1: Data Pre-processing
The first step for creating the Simple Linear Regression model is data pre-processing. But there will be some
changes, which are given in the below steps:
o First, we will import the three important libraries, which will help us for loading the dataset, plotting the graphs,
and creating the Simple Linear Regression model.
import numpy as nm
import pandas as pd
o Next, we will load the dataset into our code: Download dataset
data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE screen by clicking on
the variable explorer option.

The above output shows the dataset, which has two variables: Salary and Experience.
o After that, we need to extract the dependent and independent variables from the given dataset. The independent
variable is years of experience, and the dependent variable is salary. Below is code for it:
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column from the
dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the second column and indexing
starts from the zero.
By executing the above line of code, we will get the output for X and Y variable as:

In the above output image, we can see the X (independent) variable and Y (dependent) variable has been extracted
from the given dataset.
o Next, we will split both variables into the test set and training set. We have 30 observations, so we will take 20
observations for the training set and 10 observations for the test set. We are splitting our dataset so that we can
train our model using a training dataset and then test the model using a test dataset. The code for this is given
below:
# Splitting the dataset into training and test set.

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)
By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider the
below images:
Test-dataset:

Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries take care of it for some
cases, so we don't need to perform it here. Now, our dataset is well prepared to work on it and we are going to
start building a Simple Linear Regression model for the given problem.
Step-2: Fitting the Simple Linear Regression to the Training Set:
Now the second step is to fit our model to the training dataset. To do so, we will import the LinearRegression class
of the linear_model library from the scikit learn. After importing the class, we are going to create an object of the class
named as a regressor. The code for this is given below:
#Fitting the Simple Linear Regression model to the training dataset

from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression object to the
training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset
for the dependent and an independent variable. We have fitted our regressor object to the training
set so that the model can easily learn the correlations between the predictor and target variables.
After executing the above lines of code, we will get the below output.
Output:

Step: 3. Prediction of test set result:
Dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the output for the
new observations. In this step, we will provide the test dataset (new observations) to the model to check whether it can
predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of
training set respectively.
#Prediction of Test and Training set result

y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)
On executing the above lines of code, two variables named y_pred and x_pred will generate in the variable explorer
options that contain salary predictions for the training set and test set.

Step: 4. visualizing the Training set results:
Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the pyplot library,
which we have already imported in the pre-processing step. The scatter () function will create a scatter plot of
observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function,
we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and
color of the observations. Here we are taking a green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this
function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the
line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass the name
("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:
mtp.scatter(x_train, y_train, color="green")

mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
Output:
By executing the above lines of code, we will get the below graph plot as an output.
In the above plot, we can see the real values observations in green dots and predicted values are
covered by the red regression line. The regression line shows a correlation between the dependent
and independent variable.
The good fit of the line can be observed by calculating the difference between actual values and
predicted values. But as we can see in the above plot, most of the observations are close to the
regression line, hence our model is good for the training set.
Step: 5. visualizing the Test set results:
In the previous step, we have visualized the performance of our model on the training set. Now, we will do the same for
the Test set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test
instead of x_train and y_train.
Here we are also changing the color of observations and regression line to differentiate between the two plots, but it is
optional.
#visualizing the Test set results

mtp.scatter(x_test, y_test, color="blue")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.show()

Output:
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is given by the red regression line. As
we can see, most of the observations are close to the regression line, hence we can say our Simple Linear Regression is
a good model and able to make good predictions.
Source code all at one Place:
#importing libraries
import numpy as nm
import pandas as pd
data_set= pd.read_csv('Salary_Data.csv')
x= data_set.iloc[:, :-1].values

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)
#Fitting the Simple Linear Regression model to the training dataset

from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
#Prediction of Test and Training set result

y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)
mtp.scatter(x_train, y_train, color="green")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.show()
#visualizing the Test set results

mtp.scatter(x_test, y_test, color="blue")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.show()
Resources:
 salary_data.csv
 simple_linear_regression.py

5. The dataset contains information of users from a company’s database. It contains information
about UserID, Gender, Age, EstimatedSalary, and Purchased. Use this dataset for predicting that a
user will purchase the company’s newly launched product or not by Logistic Regression model.
Data – User Data
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
o Data Pre-processing step

o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use
it in our code efficiently. It will be the same as we have done in Data pre-processing topic. The code
for this is given below:
#Data Pre-procesing Step

# importing libraries
import numpy as nm
import pandas as pd

#importing datasets
data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the given
image:
Now, we will extract the dependent and independent variables from the given dataset. Below is the
code for it:
#Extracting Independent and dependent Variable

x= data_set.iloc[:, [2,3]].values
In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at
index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

The output for this is given below:
For test set:

For training set:
In logistic regression, we will do feature scaling because we want accurate result of predictions.
Here we will only scale the independent variable because dependent variable have only 0 and 1
values. Below is the code for it:
#feature Scaling
The scaled output is given below:
2. Fitting Logistic Regression to the Training set:
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:
#Fitting Logistic Regression to the training set

from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the below output:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=0, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
Hence our model is well fitted to the training set.

3. Predicting the Test Result
Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
#Predicting the test set result

y_pred= classifier.predict(x_test)
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
4. Test Accuracy of the result
Now we will create the confusion matrix here to check the accuracy of the classification. To create it,
we need to import the confusion_matrix function of the sklearn library. After importing the function,
we will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual
values) and y_pred (the targeted value return by the classifier). Below is the code for it:
#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix
cm= confusion_matrix(y_test,y_pred)
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
5. Visualizing the training set result
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
#Visualizing the training set result

from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken
are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided
colors (purple and green). In this function, we have passed the classifier.predict to show the
predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:
The graph can be explained in the below points:
o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result
for purchased variables.

o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably 0,
i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points
in the purple region(Not buying the car). So we can say that younger users with a high
estimated salary purchased the car, whereas an older user with a low estimated salary did not
purchase the car.
The goal of the classifier:
We have successfully visualized the training set result for the logistic regression, and our goal for
this classification is to divide the users who purchased the SUV car and who did not purchase the
car. So from the output graph, we can clearly see the two regions (Purple and Green) with the
observation points. The Purple region is for those users who didn't buy the car, and Green Region is
for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used the
Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.
Visualizing the test set result:
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we will
use x_test and y_test instead of x_train and y_train. Below is the code for it:
#Visulaizing the test set result

x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),

mtp.title('Logistic Regression (Test set)')
mtp.xlabel('Age')
mtp.legend()
mtp.show()
Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations are in
the purple region. So we can say it is a good prediction and model. Some of the green and purple
data points are in different regions, which can be ignored as we have already calculated this error
using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification problem.

6. Write a program to implement the naïve Bayesian classifier for a sample training data set stored
as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.
Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for
solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building
the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying
articles.
Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that
it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Frequency table for the Weather Conditions:
Likelihood table weather condition:
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between
features.
Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.

Python Implementation of the Naïve Bayes algorithm:
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can easily
compare the Naive Bayes model with the other models.
Steps to implement:
o Fitting Naive Bayes to the Training set
1) Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is
similar as we did in data-pre-processing
. The code for this is given below:
Importing the libraries

import numpy as nm
import pandas as pd
# Importing the dataset

dataset = pd.read_csv('user_data.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
print ('\n the total number of Training Data :',y_train.shape)
print ('\n the total number of Test Data :',y_test.shape)
# Feature Scaling
sc = StandardScaler()

x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
In the above code, we have loaded the dataset into our program using
"dataset = pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then we
have scaled the feature variable.
The output for the dataset is given as:
2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is
the code for it:
# Fitting Naive Bayes to the Training set

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()

In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We can
also use other classifiers as per our requirement.
Output:
Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)
3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new predictor variable y_pred, and
will use the predict function to make the predictions.
# Predicting the Test set results

y_pred = classifier.predict(x_test)
Output:

The above output shows the result for prediction vector y_pred and real vector y_test. We can see
that some predications are different from the real values, which are the incorrect predictions.
4) Creating Confusion Matrix to Check Accuracy:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is
the code for it:
# Predicting the Test set results

# Printing Confusion Matrix, Accuracy, Precision, Recall

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
cm = confusion_matrix(y_test, y_pred)
print('\n Confusion matrix:')
print(cm)
print('\n Accuracy of the classifier is',accuracy_score(y_test,y_pred))
print('\n The value of Precision', precision_score(y_test,y_pred))
print('\n The value of Recall', recall_score(y_test,y_pred))
Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.
5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:
# Visualising the Training set results

X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
mtp.xlim(X1.min(), X1.max())
mtp.ylim(X2.min(), X2.max())
mtp.title('Naive Bayes (Training set)')
mtp.xlabel('Age')
mtp.legend()
mtp.show()
Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points with
the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.
6) Visualizing the Test set result:

# Visualising the Test set results
X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
mtp.xlim(X1.min(), X1.max())
mtp.ylim(X2.min(), X2.max())
mtp.title('Naive Bayes (test set)')
mtp.xlabel('Age')
mtp.legend()
mtp.show()
Output:

The above output is final output for test set data. As we can see the classifier has created a Gaussian
curve to divide the "purchased" and "not purchased" variables. There are some wrong predictions
which we have calculated in Confusion matrix. But still it is pretty good classifier.

7. Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can use
Python ML library classes
Theory
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
 The directed acyclic graph is a set of random variables represented by nodes.

 The conditional probability distribution of a node (random variable) is defined for every
possible outcome of the preceding causal node(s).
For illustration, consider the following example. Suppose we attempt to turn on our computer, but
the computer does not start (observation/evidence). We would like to know which of the possible
causes of computer failure is more likely. In this simplified illustration, we assume only two possible
causes of this misfortune: electricity failure and computer malfunction.
The corresponding directed acyclic graph is depicted in below figure.
Fig: Directed acyclic graph representing two independent possible causes of a computer failure.

The goal is to calculate the posterior conditional probability distribution of each of the possible
unobserved causes given the observed evidence, i.e. P [Cause | Evidence].
Data Set:
Title: Heart Disease Databases
The Cleveland database contains 76 attributes, but all published experiments refer to using a subset
of 14 of them. In particular, the Cleveland database is the only one that has been used by ML
researchers to this date. The "Heartdisease" field refers to the presence of heart disease in the
patient. It is integer valued from 0 (no presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:
1. age: age in years

2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
 Value 1: typical angina
 Value 2: atypical angina
 Value 3: non-anginal pain
 Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
 Value 0: normal
 Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevationor depression of > 0.05 mV)
 Value 2: showing probable or definite left ventricular hypertrophy by
Estes'criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
 Value 1: upsloping
 Value 2: flat
 Value 3: downsloping
12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
13. Heartdisease: It is integer valued from 0 (no presence) to 4.

Some instance from the dataset:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Heartdisease
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
60 1 4 130 206 0 2 132 1 2.4 2 2 7 4
Program:
import numpy as np
import pandas as pd
import csv
import pgmpy
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.models import BayesianModel
from pgmpy.inference import VariableElimination
heartDisease = pd.read_csv('heart.csv')
heartDisease = heartDisease.replace('?',np.nan)
print('Sample instances from the dataset are given below')

print(heartDisease.head())
print('\n Attributes and datatypes')

print(heartDisease.dtypes)
model=
BayesianModel([('age','heartdisease'),('sex','heartdisease'),('exang','heartdisease'),('cp','heartdisease'),
('heartdisease','restecg'),('heartdisease','chol')])
print('\nLearning CPD using Maximum likelihood estimators')
model.fit(heartDisease,estimator=MaximumLikelihoodEstimator)
print('\n Inferencing with Bayesian Network:')

HeartDiseasetest_infer = VariableElimination(model)
print('\n 1. Probability of HeartDisease given evidence= restecg')

q1=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'restecg':1})
print(q1)
print('\n 2. Probability of HeartDisease given evidence= cp ')

q2=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'cp':2})
print(q2)
8. Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an
appropriate data set for building the decision tree and apply this knowledge to classify a new
sample.
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used
to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do
not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree
algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the
given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.

o Step-5: Recursively make new decision trees using the subsets of the dataset created in step
-3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an
attribute.

o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having
the highest information gain is split first. It can be calculated using the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the CART (Classification and
Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:
o Cost Complexity Pruning

o Reduced Error Pruning.
Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while making any decision in
real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Python Implementation of Decision Tree

Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the same dataset,
we can compare the Decision tree classifier with other classification models such as KNN
SVM, LogisticRegression, etc.
Steps will also remain the same, which are given below:
o Fitting a Decision-Tree algorithm to the Training set
1. Data Pre-Processing Step:

Below is the code for the pre-processing step:
import numpy as nm

import pandas as pd
#importing datasets


#feature Scaling
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is
given as:

2. Fitting a Decision-Tree algorithm to the Training set
Now we will fit the model to the training set. For this, we will import
the DecisionTreeClassifier class from sklearn.tree library. Below is the code for it:
#Fitting Decision Tree classifier to the training set

From sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
In the above code, we have created a classifier object, in which we have passed two main
parameters;
o "criterion='entropy': Criterion is used to measure the quality of split, which is calculated by information gain
given by entropy.
o random_state=0": For generating the random states.
Below is the output for this:
Out[8]:
DecisionTreeClassifier(class_weight=None,criterion='entropy',
max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
3. Predicting the test result

Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the
code for it:

Output:
In the below output image, the predicted output and real test output are given. We can clearly see
that there are some values in the prediction vector, which are different from the real vector values.
These are prediction errors.

4. Test accuracy of the result (Creation of Confusion matrix)
In the above output, we have seen that there were some incorrect predictions, so if we want to
know the number of correct and incorrect predictions, we need to use the confusion matrix. Below
is the code for it:

cm= confusion_matrix(y_test, y_pred)
Output:

In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.
5. Visualizing the training set result:

Here we will visualize the training set result. To visualize the training set result we will plot a graph
for the decision tree classifier. The classifier will predict yes or No for the users who have either
Purchased or Not purchased the SUV car as we did in Logistic Regression.
Below is the code for it:
#Visulaizing the trianing set result

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
fori, j in enumerate(nm.unique(y_set)):

mtp.title('Decision Tree Algorithm (Training set)')
mtp.xlabel('Age')
mtp.legend()
mtp.show()
Output:
The above output is completely different from the rest classification models. It has both vertical and
horizontal lines that are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
6. Visualizing the test set result:

Visualization of test set result will be similar to the visualization of the training set except that the
training set will be replaced with the test set.
#Visulaizing the test set result


fori, j in enumerate(nm.unique(y_set)):
mtp.title('Decision Tree Algorithm(Test set)')
mtp.xlabel('Age')
mtp.legend()
mtp.show()
Output:
As we can see in the above image that there are some green data points within the purple region
and vice versa. So, these are the incorrect predictions which we have discussed in the confusion
matrix.
OR
Algorithm Concepts
1. To understand this concept, we take an example, assuming we have a data set

2. Based on this data, we have to find out if we can play someday or not.
3. We have four attributes in the data-set. Now how do we decide which attribute we should put on
the root node?
4. For this, we will Calculate the information gain of all the attributes (Features), which will have
maximum information will be our root node.

Step1 : Creating a root node
Entorpy(Entropy of whole data-set)
Entropy (S) = (-p/p+n)*log2 (p/p+n) - (n/n+p)*log2 ((n/n+p))
p- p stand for number of positive examples
n- n stand for number of negative examples.
Step2: For Every Attribute/Features

Average Information (AIG of a particular attribute)
I(Attribute) = Sum of {(pi+ni/p+n)*Entropy(Entropy of Attribute)}
pi- Here pi stand for number of positive examples in particular attribute.
ni- Here ni stand for number of negative examples in particular attribute.
Entropy (Attribute) - Entropy of Attribute calculated in same as we calculated for System (Whole
Data-Set)
Information Gain
Gain = Entropy(S) - I (Attribute)
1. If all examples are positive, Return the single-node tree ,with label=+
2. If all examples are Negative, Return the single-node tree,with label= -
3. If Attribute empty, Return the single-node tree
Step4: Pick The Highest Gain Attribute
1. The attribute that has the most information gain has to create a group of all the its attributes and
process them in same as which we have done for the parent (Root) node.
2. Again, the feature which has maximum information gain will become a node and this process
will continue until we get the leaf node.

Step5: Repeat Until we get final node (Leaf node )
Implementation of Decision-Tree (ID3) Algorithm
#Importing important libraries

import pandas as pd
from pandas import DataFrame
#Reading Dataset
df_tennis = pd.read_csv('DS.csv')
print( df_tennis)
Output
Calculating Entropy of Whole Data-set

#Function to calculate final Entropy
def entropy(probs):
import math
return sum( [-prob*math.log(prob, 2) for prob in probs] )
#Function to calculate Probabilities of positive and negative examples
def entropy_of_list(a_list):
from collections import Counter
cnt = Counter(x for x in a_list) #Count the positive and negative ex
num_instances = len(a_list)
#Calculate the probabilities that we required for our entropy formula
probs = [x / num_instances for x in cnt.values()]
#Calling entropy function for final entropy

return entropy(probs)
total_entropy = entropy_of_list(df_tennis['PT'])
print("\n Total Entropy of PlayTennis Data Set:",total_entropy)
Output
collections.Counter()
A counter is a container that stores elements as dictionary keys, and their counts are stored as
dictionary values.

Calculate Information Gain for each Attribute
#Defining Information Gain Function
def information_gain(df, split_attribute_name, target_attribute_name, trace=0):
print("Information Gain Calculation of ",split_attribute_name)
print("target_attribute_name",target_attribute_name)
#Grouping features of Current Attribute
df_split = df.groupby(split_attribute_name)
for name,group in df_split:
print("Name: ",name)
print("Group: ",group)
nobs = len(df.index) * 1.0
print("NOBS",nobs)
#Calculating Entropy of the Attribute and probability part of formula
df_agg_ent = df_split.agg({target_attribute_name : [entropy_of_list, lambda x:

len(x)/nobs] })[target_attribute_name]
print("df_agg_ent",df_agg_ent)
# Calculate Information Gain
avg_info = sum( df_agg_ent['Entropy'] * df_agg_ent['Prob1'] )
old_entropy = entropy_of_list(df[target_attribute_name])
return old_entropy - avg_info
print('Info-gain for Outlook is :'+str(information_gain(df_tennis, 'Outlook', 'PT')),"\n")
In the same way, we will Calculate the information gain of the remaining attributes and then the
attribute who has the most information will be named the best attribute

9. Implement k-nearest neighbor’s classification to classify the iris data set using
python.
K-Nearest Neighbor (KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.

Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need a
K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:

How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for all the training
samples.
Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use the same problem and dataset
which we have used in Logistic Regression. But here we will improve the performance of the model.
Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new
SUV car. The company wants to give the ads to the users who are interested in buying that SUV. So
for this problem, we have a dataset that contains multiple user's information through the social
network. The dataset contains lots of information but the Estimated Salary and Age we will
consider for the independent variable and the Purchased variable is for the dependent variable.
Below is the dataset:

Steps to implement the K-NN algorithm:

o Fitting the K-NN algorithm to the Training set
Data Pre-Processing Step:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code
for it:
import numpy as nm
import pandas as pd
#importing datasets
#feature Scaling
By executing the above code, our dataset is imported to our program and well pre-processed. After
feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
o Fitting K-NN classifier to the Training data:

Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class, we
will create the Classifier object of the class. The Parameter of this class will be
o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes 5.

o metric='minkowski': This is the default parameter and it decides the distance
between the points.
o p=2: It is equivalent to the standard Euclidean metric.
And then we will fit the classifier to the training data. Below is the code for it:
#Fitting K-NN classifier to the training set

from sklearn.neighbors import KNeighborsClassifier
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in Logistic
Regression. Below is the code for it:

Output:
The output for the above code will be:

o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the classifier. Below is the
code for it:

cm= confusion_matrix(y_test, y_pred)
In above code, we have imported the confusion_matrix function and called it using the variable cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say that
the performance of the model is improved by using the K-NN algorithm.
o Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will remain same as
we did in Logistic Regression, except the name of the graph. Below is the code for it:

#Visulaizing the trianing set result

alpha = 0.75, cmap = ListedColormap(('red','green' )))

c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('K-NN Algorithm (Training set)')
mtp.xlabel('Age')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the below graph:
The output graph is different from the graph which we have occurred in Logistic Regression. It can
be understood in the below points:

o As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who didn't buy
the SUV are in the red region and users who bought the SUV are in the green region.
o The graph is showing good result but still, there are some green points in the red region and
red points in the green region. But this is no big issue as by doing this model is prevented
from overfitting issues.
o Hence our model is well trained.
OR
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import datasets
iris=datasets.load_iris()
x = iris.data
y = iris.target
print ('sepal-length', 'sepal-width', 'petal-length', 'petal-width')
print(x)
print('class: 0-Iris-Setosa, 1- Iris-Versicolour, 2- Iris-Virginica')
print(y)
# Splits the dataset into 70% train data and 30% test data. This means that out of total 150 records,
the training set will contain

# 105 records and the test set contains 45 of those records
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)
#To Training the model and Nearest nighbors K=5

classifier = KNeighborsClassifier(n_neighbors=5)
#to make predictions on our test data

print('Confusion Matrix')
print(confusion_matrix(y_test,y_pred))
print('Accuracy Metrics')
print(classification_report(y_test,y_pred))

ML Lab Experiments

Uploaded by

Copyright:

Available Formats

ML Lab Experiments

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Lab Experiments

Uploaded by

Copyright:

Available Formats

LORDS INSTITUTE OF ENGINEERING

Course Name : Machine Learning Lab

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 1

Recommended System/Software Requirements

 Operating system: Flavor of any WINDOWS or UNIX/LINUX.

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 2

Distributed applications are programs running on interconnected computers; a web

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 3

 The components required pertaining to the experiment should be collected from

 Any damage of the equipment or burn-out components will be viewed seriously

 Students should be present in the labs for total scheduled duration.

 Students are required to prepare thoroughly to perform the experiment before

a) Operation of data types in Python.

 Operator is a symbol that performs certain operations.

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 6

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 7

str = "Online Smart Trainer !"

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 8

#Using single quotes

#Using triple quotes

Concatenate two strings in Python

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 9

 This is an easy way to combine the two strings.

# + Operator is used to strings concatenation

2. Using join( ) method:

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 10

# Initialise the string

print ("Resultant substring from end:", end)

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 11

 Lists are used to store multiple items in a single variable.

List Items - Data Types

 A list can contain different data types:

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 13

Python List Methods

append() Adds an element at the end of the list

clear() Removes all the elements from the list

copy() Returns a copy of the list

count() Returns the number of elements with the specified value

insert() Adds an element at the specified position

pop() Removes the element at the specified position

remove() Removes the item with the specified value

reverse() Reverses the order of the list

sort() Sorts the list

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 14

#Change the value of a list item

#Loop through a list

#Check if a list item exists

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 15

#Add an item to the end of the a list

#Add an item at a specified index

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 16

#Remove the last item

#Remove an item at a specified index

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 17

thislist = ["apple", "banana", "cherry"]

e. Write a program to demonstrate working with tuples in python.

ABDUL RAIS, ASSISTANT PROFESSOR, LIET Page 18

Create Tuple With One Item

 Tuple Items - Data Types

A tuple can contain different data types: