7IT07
7IT07
7IT07
Practical Record
Semester VII
(Subject code: 7IT07)
Subject: Machine Learning Lab
(Academic Year: 2022-2023)
CERTIFICATE
6 List of experiment vi
7 a.Title/Aim
b.Apparatus/Components used
c. Theory
d. Procedure/Steps
e. Observations
f. Output/Result
g. Conclusion.
h. Viva Questions
Machine Learning Lab [7IT07]
VISION
To become a pace-setting
Centre of excellence believing in three
Universal values namely
Synergy, Trust and Passion,
with zeal to serve the Nation
in the global scenario
MISSION
M1: To achieve the highest standard in technical education through the state-of-the-art
pedagogy and enhanced industry Institute linkages.
M2: To inculcate the culture of research in core and emerging areas.
M3: To strive for overall development of students so as to nurture ingenious technocrats as
well as responsible citizens.
VISION
Attaining growing needs of industry and society through
Information Technology with ethical values.
MISSION
QUALITY POLICY
“Striving for Excellence in the Quality Engineering Education”
UNIT 6: Multilayer Artificial Neural Network and Deep Learning: Modeling complex
functions with artificial neural networks, Classifying handwritten digits, Training an artificial
neural network, About the convergence in neural networks, A few last words about the neural
network implementation, Parallelizing Neural Network Training with Tensor Flow, Tensor
Flow and training performance.
Text Books
1. Sebastian Raschka, and Vahid Mirjalili ―Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and Tensor Flow.
Reference books:
1. Andriy Burkov ― The Hundred - Page ‘Machine Learning Book’.
2. Aurélien Géron ―’Hands-on Machine Learning’ with Scikit-Learn, Keras, and Tensor
Flow: Concepts, Tools, and Techniques to Build Intelligent Systems.
3. Andreas C. Müller & Sarah Guido ―Introduction to Machine Learning with Python: A
Guide for Data Scientists.
4. Chris Albon ―Machine Learning with Python Cookbook: Practical Solutions from
Preprocessingto Deep Learning
A C I P V Total
Roll no.
(5marks) (5marks) (5marks) (5marks) (5marks) (25Marks)
6.List of Experiments
Practical No. 1
Theory: In order to use Python for machine learning, we need to install it on our computer
system with a compatible Integrated Development Environment (IDE).
Below some steps are given to show the downloading and installing process of Anaconda and
IDE:
To download Anaconda in your system, firstly, open your favorite browser and type
Download Anaconda Python, and then click on the first link as given in the below image.
Alternatively, you can directly download it by clicking on this
link, https://www.anaconda.com/distribution/#download-section.
After clicking on the first link, you will reach to download page of Anaconda, as shown in
the below image:
Since, Anaconda is available for Windows, Linux, and Mac O.S., hence, you can download it
as per your OS type by clicking on available options shown in below image. It will provide
you Python 2.7 and Python 3.7 versions, but the latest version is 3.7, hence we will download
Python 3.7 version. After clicking on the download option, it will start downloading on your
computer.
Once the downloading process gets completed, go to downloads → double click on the ".exe"
file (Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will open a setup window
for Anaconda installations as given in below image, then click on Next.
It will open a License agreement window click on "I Agree" option and move further.
In the next window, you will get two options for installations as given in the below image.
Select the first option (Just me) and click on Next
Now you will get a window for installing location, here, you can leave it as default or change
it by browsing a location, and then click on Next. Consider the below image:
Now installation is completed, tick the checkbox if you want to learn more about Anaconda
and Anaconda cloud. Click on Finish to end the process.
Note: Here, we will use the Spyder IDE to run Python programs.
o After opening the navigator, launch the Spyder IDE by clicking on the Launch button
given below the Spyder. It will install the Spyder IDE in your system.
Write your first program, and save it using the .py extension.
You can check the program's output on console pane at the bottom right side.
Conclusion:
Viva Question:
Signature
Practical No. 2
Theory:
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an example
of the dataset
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.
Note: A real-world dataset is of huge size, which is difficult to manage and process at the
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset.
Need of Dataset
To work with machine learning projects, we need a huge amount of data, because, without
the data, one cannot train ML/AI models. Collecting and preparing the dataset is one of the
most crucial parts while creating an ML/AI project.
The technology applied behind any ML projects cannot work properly if the dataset is not
well prepared and pre-processed.
During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:
o Training dataset:
o Test Dataset
Below is the list of datasets which are freely available for the public to work on it:
Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve difficult
Data Science related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find and
download.
UCI Machine learning repository is one of the great sources of machine learning datasets.
This repository contains databases, domain theories, and data generators that are widely used
by the machine learning community for the analysis of ML algorithms.
Since the year 1987, it has been widely used by students, professors, researchers as a primary
source of machine learning dataset.
It classifies the datasets as per the problems and tasks of machine learning such
as Regression, Classification, Clustering, etc. It also contains some of the popular datasets
such as the Iris dataset, Car Evaluation dataset, Poker Hand dataset, etc.
We can search, download, access, and share the datasets that are publicly available via AWS
resources. These datasets can be accessed through AWS resources but provided and
maintained by different government organizations, researches, businesses, or individuals.
Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.
This source provides the various types of datasets with examples and ways to use the dataset.
It also provides the search box using which we can search for the required dataset. Anyone
can add any dataset or example to the Registry of Open Data on AWS.
5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing, computer
vision, and domain-specific sciences.
Using this resource, we can download the datasets to use on the current device, or we can
also directly use it on the cloud infrastructure.
Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is
better to check the license before downloading the dataset.
7. Government Datasets
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.
The goal of providing these datasets is to increase transparency of government work among
the people and to use the data in an innovative approach. Below are some links of
government datasets:
Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you can
refer to this source.
9. Scikit-learn dataset
Scikit-learn is a great source for machine learning enthusiasts. This source provides both toy
and real-world datasets. These datasets can be obtained from sklearn.datasets package and
using general dataset API.
The toy dataset available on scikit-learn can be loaded using some predefined functions such
as, load_boston([return_X_y]), load_iris([return_X_y]), etc, rather than importing any file
from external sources. But these datasets are not suitable for real-world projects.
Conclusion:
Viva Question:
Signature
Practical No. 3
Aim: To study of Data Preprocessing in Machine learning
Theory:
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets
in programs.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices. So, in
Python, we can import it as:
import numpy as np
Here we have used np, which is a short name for Numpy, and it will be used in the whole
program.
Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow the
below steps:
Note: We can set any directory as a working directory, but it must contain the required
dataset.
Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.
data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in
Python. We can also change the format of our dataset by clicking on the format
option.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is
for all the columns. Here we have used :-1, because we don't want to take the last column
as it contains the dependent variable. So by doing this, we will get the matrix of features.
The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
#handling missing data (Replacing missing data with the mean value)
1. from sklearn.preprocessing import Imputer
2. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
3. #Fitting imputer object to the independent variables x.
4. imputerimputer= imputer.fit(x[:, 1:3])
5. #Replacing missing data with the calculated mean value
6. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.
Conclusion: -
Viva Question:
Signature
Practical No. 4
Objective:
• To understand the programming constructs in python
• To understand basic statistics concepts and implement its formulae using
python
Theory:
Numeric Types:
Integers:
In Python 3, there is effectively no limit to how long an integer value
can be. Of course, it is constrained by the amount of memory your
system has.
>>>
print(10) 10
>>> type(10)
>>> 4.2
4.2
>>>
.4e7
4000
000.
0
Complex Numbers
>>> 2+3j
(2+3j)
>>> type(2+3j)
Complex numbers are specified as <real part>+<imaginary part>j.
Strings:
Strings are sequences of character data. The string type in Python is called str. String literals
may be delimited using either single or double quotes. All the characters between the opening
delimiter and matching closing delimiter are part of the string.
A string in Python can contain as many characters as you wish. The only limit is your
>>> print("I am a
string.") I am a string.
<class 'str'>
A raw string literal is preceded by r or R, which specifies that escape sequences in the
>>>
print('foo\nbar')
foo
associated string are not translated. The backslash character is left in the string.
>>>
print(r'foo\nbar')
foo\nbar
>>>
print('foo\\bar')
Boolean Type:
Python 3 provides a Boolean data type. Objects of Boolean type may have one of two values,
>>> type(True)
<class 'bool'>
>>> type(False)
True or False.
Python List:
List is an ordered sequence of items. It is one of the most used datatype in Python and is very
flexible. All the items in a list do not need to be of the same type. Declaring a list is pretty
straight forward. Items separated by commas are enclosed within brackets [ ]. We can use the
slicing operator [ ] to extract an item or a range of items from a list. Index starts form 0 in
Python.
Python Tuple:
Tuple is an ordered sequence of items same as list. The only difference is that tuples are
immutable. Tuples once created cannot be modified. Tuples are used to write-protect data and
are usually faster than list as it cannot change dynamically. It is defined within parentheses ()
where items are separated by commas. We can use the slicing operator [] to extract items but
Python Set:
Set is an unordered collection of unique items. Set is defined by values separated by comma
inside braces { }. Items in a set are not ordered. We can perform set operations like union,
intersection on two sets. Set have unique values. They eliminate duplicates. Since, set are
unordered collection, indexing has no meaning. Hence the slicing operator [] does not work.
>>> a =
{1,2,2,3,3,3}
>>>
a
{1, 2,
3}
Python Dictionary:
Operators in Python
Operators are used to perform operations on variables and values. Python divides the
operators in the following groups:
1. Arithmetic operators
Arithmetic operators are used to perform mathematical operations like addition, subtraction,
multiplication and division.
2. Comparison/Relational operators
Relational operators compares the values. It either returns True or False
according to the condition.
> Greater than: True if left operand is greater than the right x>y
< Less than: True if left operand is less than the right x < y
>= Greater than or equal to: True if left operand is greater than or equal to the
right x >= y
<= Less than or equal to: True if left operand is less than or equal to the right x<=
y
3. Logical operators
Logical operators perform Logical AND, Logical OR and Logical NOT operations.
and Logical AND: True if both the operands are true x and y
or Logical OR: True if either of the operands is true x or y
not Logical NOT: True if operand is false not x
Control Statements in Python
This describes the environment in which Python programs are executed. This describes the
runtime behavior of the interpreter, including program startup, configuration, and program
termination
Anaconda is a free and open-source distribution of the Python and R programming languages
for scientific computing, machine learning and data science that aims to simplify package
management and deployment.
Notebook documents: a representation of all content visible in the web application, including
inputs and outputs of the computations, explanatory text, mathematics, images, and rich
media representations of objects.
PyCharm IDE
PyCharm is an integrated development environment (IDE) used in computer programming,
specifically for the Python language. It is developed by the JetBrains. It provides code
analysis, a graphical debugger, an integrated unit tester, integration with version control
systems, and supports web development with Django as well as Data Science with Anaconda
Google COLAB
Colaboratory is a research tool for machine learning education and research. It‘s a Jupyter
notebook environment that requires no setup to use. Google Colab is a free cloud service and
now it supports free GPU! You can: improve your Python programming language coding
skills. To start working with Colab you first need to log in to your google account, then go to
this link https://colab.research.google.com.
There are three main measures of central tendency: the mode, the median and the mean. Each
of these measures describes a different indication of the typical or central value in the
distribution.
Mean
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average. Looking at the retirement age
distribution again:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Median
The median is the middle value in distribution when the values are arranged in ascending or
descending order.
In a distribution with an odd number of observations, the median value is the middle value.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Looking at the retirement age distribution (which has 11 observations), the median is the
middle value, which is 57 years.
When the distribution has an even number of observations, the median value is the mean of
the two middle values. In the following distribution,
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The middle two values are 56 and 57; therefore the median equals 56.5 years.
Mode
The mode is the most commonly occurring value in a distribution. Consider this dataset
showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
Measure of Dispersion
Measures of spread describe how similar or varied the set of observed values are for a
particular variable (data item). Measures of spread include the range, quartiles and the
interquartile range, variance and standard deviation. The spread of the values can be
measured for quantitative data, as the variables are numeric and can be arranged into a logical
order with a low end value and a high end value.
The variance and the standard deviation are measures of the spread of the data around the
mean. They summarise how close each observed data value is to the mean value.
In datasets with a small spread all values are very close to the mean, resulting in a small
variance and standard deviation. Where a dataset is more dispersed, values are spread further
away from the mean, leading to a larger variance and standard deviation.
The smaller the variance and standard deviation, the more the mean value is indicative of the
whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and
variance are zero.
Programming Steps:
Program:
Conclusion: -
Signature
Department of Information Technology
Viva Question:
Practical No. 5
Aim: To Study of Python Libraries for ML application such as Pandas and Matplotlib.
Objective:
To understand data preprocessing and analysis using Pandas library
To understand data visualization in the form of 2D graphs and plots using
Matplotlib library
Numpy
Scipy
Scikit-learn
Theano
TensorFlow
Keras
PyTorch
Pandas
Matplotlib
Pandas makes importing, analyzing, and visualizing data much easier. It builds on packages
like NumPy and matplotlib to give you a single, convenient, place to do most of your data
analysis and visualization work.
There are many benefits of Python Pandas library, listing them all would probably take more
time than what it takes to learn the library. Therefore, these are the core advantages of using
the Pandas library:
1) Data representation
Pandas provide extremely streamlined forms of data representation. This helps to analyze and
understand data better. Simpler data representation facilitates better results for data science
projects.
It is one of the best advantages of Pandas. What would have taken multiple lines in Python
without any support libraries, can simply be achieved through 1-2 lines with the use of
Pandas.
Thus, using Pandas helps to shorten the procedure of handling data. With the time saved, we
can focus more on data analysis algorithms.
Pandas are really powerful. They provide you with a huge set of important commands and
features which are used to easily analyze your data. We can use Pandas to perform various
tasks like filtering your data according to certain conditions, or segmenting and segregating
the data according to preference, etc.
Wes McKinney, the creator of Pandas, made the python library to mainly handle large
datasets efficiently. Pandas help to save a lot of time by importing large amounts of data very
fast.
Pandas provide a huge feature set to apply on the data you have so that you can customize,
edit and pivot it according to your own will and desire. This helps to bring the most out of
your data.
Python programming has become one of the most sought after programming languages in the
world, with its extensive amount of features and the sheer amount of productivity it provides.
Therefore, being able to code Pandas in Python, enables you to tap into the power of the
various other features and libraries which will use with Python. Some of these libraries are
NumPy, SciPy, MatPlotLib, etc.
Pandas Library
The primary two components of pandas are the Series and DataFrame.
DataFrames and Series are quite similar in that many operations that you can do with one you
can do with the other, such as filling in null values and calculating the mean.
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv') df
We're loading this dataset from a CSV and designating the movie titles to be our index.
The first thing to do when opening a new dataset is print out a few rows to keep as a visual
reference. We accomplish this with .head():
movies_df.head()
Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):
movies_df.shape
Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we
have 1000 rows and 11 columns in our movies DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. For example, you might
filter some rows based on some criteria and then want to know quickly how many rows were
removed.
Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you aren't
aggregating duplicate rows.
To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:
temp_df =
movies_df.append(movies_d
f) temp_df.shape
Out: (2000, 11)
Using append()will return a copy without affecting the original DataFrame. We are capturing
this copy in tempso we aren't working with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled. Now we can try
dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append(), the drop_duplicates() method will also return a copy of your DataFrame,
but this time with duplicates removed. Calling .shapeconfirms we're back to the 1000 rows of
our original dataset.
It's a little verbose to keep assigning DataFrames to the same variable like in this example.
For this reason, pandas has the inplace keyword argument on many of its methods. Using
inplace=Truewill modify the DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Another important argument for drop_duplicates()is keep, which has three possible options:
Since we didn't define the keep arugment in the previous example it was defaulted to first.
This means that if two rows are the same pandas will drop the second row and keep the first
row. Using lasthas the opposite effect: the first row is dropped.
keep, on the other hand, will drop all duplicates. If two rows are the same then both will be
dropped. Watch what happens to temp_df:
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left
over. If you're wondering why you would want to do this, one reason is that it allows you to
locate all duplicates in your dataset. When conditional selections are shown below you'll see
how to do that.
Column cleanup
Many times datasets will have verbose column names with symbols, upper and lowercase
words, spaces, and typos. To make selecting data by column name easier we can spend a little
time cleaning up their names.
movies_df.columns
Out:
dtype='object')
Not only does .columnscome in handy if you want to rename columns by allowing for simple
copy and paste, it's also useful if you need to understand why you are receiving a Key Error
when selecting data by column.
We can use the .rename() method to rename certain or all columns via a dict. We don't want
parentheses, so let's rename those:
movies_df.rename(columns={
}, inplace=True)
movies_df.columns
Out:
Excellent. But what if we want to lowercase all names? Instead of using .rename() we could
also set a list of names to the columns like so:
movies_df.columns
Out:
But that's too much work. Instead of just renaming each column manually we can do a list
comprehension:
Out:
list (and dict) comprehensions come in handy a lot when working with pandas and data in
general.
It's a good idea to lowercase, remove special characters, and replace spaces with underscores
if you'll be working with a dataset for some time.
To make necessary statistical inferences, it becomes necessary to visualize your data and
Matplotlib is one such solution for the Python users. It is a very powerful plotting library
useful for those working with Python and NumPy. The most used module of Matplotib is
Pyplot which provides an interface like MATLAB but instead, it uses Python and it is open
source.
General Concepts:
Figure: It is a whole figure which may contain one or more than one axes (plots). You can
think of a Figure as a canvas which contains plots.
Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It contains
two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.
Axis: They are the number line like objects and take care of generating the graph limits.
Artist: Everything which one can see on the figure is an artist like Text objects, Line2D
objects, collection objects. Most Artists are tied to Axes.
Matplotlib Library
Pyplot is a module of Matplotlib which provides simple functions to add plot elements like
lines, images, text, etc. to the current axes in the current figure.
plot(x-axis values, y-axis values) — plots a simple line graph with x-axis values against y-
axis values
set_title(―string‖) — an axes level method used to set the title of subplots in a figure
xticks(index, categorical variables) — Get or set the current tick locations and labels of the x-
axis
xlim(start value, end value) — used to set the limit of values of the x-axis
ylim(start value, end value) — used to set the limit of values of the y-axis
scatter(x-axis values, y-axis values) — plots a scatter plot with x-axis values against y-axis
values
set_xlabel(―string‖) — axes level method used to set the x-label of the plot specified as a string
set_ylabel(―string‖) — axes level method used to set the y-label of the plot specified as a string
scatter3D(x-axis values, y-axis values) — plots a three-dimensional scatter plot with x-axis
values against y-axis values
plot3D(x-axis values, y-axis values) — plots a three-dimensional line graph with x- axis
values against y-axis values
Here we import Matplotlib‘s Pyplot module and Numpy library as most of the data that we
will be working with will be in the form of arrays only.
We pass two arrays as our input arguments to Pyplot‘s plot() method and use show() method
to invoke the required plot. Here note that the first array appears on the x-axis and second
array appears on the y-axis of the plot. Now that our first plot is ready, let us add the title, and
name x-axis and y-axis using methods title(), xlabel() and ylabel() respectively.
We can also specify the size of the figure using method figure()and passing the values as a
tuple of the length of rows and columns to the argument figsize
With every X and Y argument, you can also pass an optional third argument in the form of a
string which indicates the colour and line type of the plot. The default format is b- which
means a solid blue line. In the figure below we use go which means green circles. Likewise,
we can make many such combinations to format our plot.
We can also plot multiple sets of data by passing in multiple sets of arguments of X and Y
axis in the plot()method as shown.
We can use subplot() method to add more than one plots in one figure. In the image below,
we used this method to separate two graphs which we plotted on the same axes in the
previous example. The subplot() method takes three arguments: they are nrows, ncols and
index. They indicate the number of rows, number of columns and the index number of the
sub-plot. For instance, in our example, we want to create two sub-plots in one figure such that
it comes in one row and in two columns and hence we pass arguments (1,2,1)and (1,2,2)in the
subplot() method. Note that we have separately used title()method for both the subplots. We
use suptitle() method to make a centralized title for the figure.
If we want our sub-plots in two rows and single column, we can pass arguments (2,1,1)
and (2,1,2)
The above way of creating subplots becomes a bit tedious when we want many subplots in
our figure. A more convenient way is to use subpltots() method. Notice the difference of ‘s’
in both the methods. This method takes two arguments nrows and ncols as number of rows
and number of columns respectively. This method creates two objects:figure and axes which
we store in variables fig and ax which can be used to change the figure and axes level
attributes respectively. Note that these variable names are chosen arbitrarily
Program:
Conclusion:-
Signature
Department of Information Technology
Viva Question:
Practical No: 6
Aim: Implement the FIND-S algorithm for finding the most specific hypothesis based on a
given set of training data samples. Read the training data from a .CSV file.
In order to understand Find-S algorithm, you need to have a basic idea of the following
concepts as well:
Concept Learning
General Hypothesis
Specific Hypothesis
1. Concept Learning
To understand The concept learning with a real-life example. Most of human learning is
based on past instances or experiences. For example, we are able to identify any type of
vehicle based on a certain set of features like make, model, etc., that are defined over a large
set of features.
These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.
Similar to this, machines can also learn from concepts to identify whether an object belongs
to a specific category or not. Any algorithm that supports concept learning requires the
following:
• Training Data
• Target Concept
• Actual Data Objects
2. General Hypothesis
3. Specific Hypothesis
The specific hypothesis fills in all the important details about the variables given in the
general hypothesis. The more specific details into the example given above would be I want a
cheeseburger with a chicken pepperoni filling with a lot of lettuce.
S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}
Now that we are done with the basic explanation of the Find-S algorithm, let us take a look at
how it works.
Looking at the data set, we have six attributes and a final attribute that defines the positive or
negative example. In this case, yes is a positive example, which means the person will go for
a walk.
This is our general hypothesis, and now we will consider each example one by one, but only
the positive examples.
Program:
Output:
Conclusion: -
Signature
Department of Information Technology
Viva Question:
Practical No: 7
Objective:
To understand the concept of simple linear regression
To apply simple linear regression on actual dataset to do prediction
Theory:
Types of Learning
A machine is said to be learning from past Experiences(data feed in) with respect to some
class of Tasks, if it‘s Performance in a given Task improves with the Experience.
1) Supervised Learning
How it works: This algorithm consist of a target / outcome variable (or dependent variable)
which is to be predicted from a given set of predictors (independent variables). Using these
set of variables, we generate a function that map inputs to desired outputs. The training
process continues until the model achieves a desired level of accuracy on the training data.
Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN,
Logistic Regression etc.
2) Unsupervised Learning
How it works: In this algorithm, we do not have any target or outcome variable to predict /
estimate. It is used for clustering population in different groups, which is widely used for
segmenting customers in different groups for specific intervention. Examples of
Unsupervised Learning: Apriori algorithm, K-means.
3) Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make specific decisions. It
works this way: the machine is exposed to an environment where it trains itself continually
using trial and error. This machine learns from past experience and tries to capture the best
possible knowledge to make accurate business decisions. Example of Reinforcement
Learning: Markov Decision Process
4) Semi-Supervised Learning
In this type of learning, the algorithm is trained upon a combination of labeled and unlabeled
data. Typically, this combination will contain a very small amount of labeled data and a very
large amount of unlabeled data. The basic procedure involved is that first, the programmer
will cluster similar data using an unsupervised learning algorithm and then use the existing
labeled data to label the rest of the unlabeled data.
Example: Gmail classifies mails in more than one classes like social, promotions, updates,
forum.
Example in above Figure B, Output – Wind Speed is not having any discrete value but is
continuous in the particular range. The goal here is to predict a value as much closer to actual
output value as our model can and then evaluation is done by calculating error value. The
smaller the error the greater the accuracy of our regression model.
• Linear Regression
• Nearest Neighbor
• Guassian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest
Regression Analysis
Regression analysis is a powerful statistical method that allows you to examine the
relationship between two or more variables of interest. While there are many types of
regression analysis, at their core they all examine the influence of one or more independent
variables on a dependent variable.
In order to understand regression analysis fully, it‘s essential to comprehend the following
terms:
Dependent Variable: This is the main factor that you‘re trying to understand or predict.
Independent Variables: These are the factors that you hypothesize have an impact on your
dependent variable.
There are multiple benefits of using regression analysis. They are as follows:
Regression analysis also allows us to compare the effects of variables measured on different
scales, such as the effect of price changes and the number of promotional activities. These
benefits help market researchers / data analysts / data scientists to eliminate and evaluate the
best set of variables to be used for building predictive models.
There are various kinds of regression techniques available to make predictions. These
techniques are mostly driven by three metrics (number of independent variables, type of
dependent variables and shape of regression line).
Types of Regression:
• Linear Regression
• Logistic Regression
• Polynomial Regression
• Stepwise Regression
• Ridge Regression
Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among
the first few topics which people pick while learning predictive modeling. In this technique,
the dependent variable is continuous, independent variable(s) can be continuous or discrete,
and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as regression line).
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish relationship between independent and dependent
variables by fitting a best line. This best fit line is known as regression line and represented
by a linear equation
Y= a *X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us
say, you ask a child in fifth grade to arrange people in his class by increasing order of weight,
without asking them their weights! What do you think the child will do? He / she would
likely look (visually analyze) at the height and build of people and arrange them using a
combination of these visible parameters. This is linear regression in real life! The child has
actually figured out that height and build would be correlated to the weight by a relationship,
which looks like the equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference of
distance between data points and regression line.
Look at the below example. Here we have identified the best fit line having linear equation
y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person
Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear
Regression. Simple Linear Regression is characterized by one independent variable. And,
Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1)
independent variables. While finding the best fit line, you can fit a polynomial or curvilinear
regression. And these are known as polynomial or curvilinear regression.
The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear regression
has only 1 independent variable.
In simple linear regression, each observation consists of two values. One value is for the
dependent variable and one value is for the independent variable.
Simple Linear Regression Analysis The simplest form of a regression analysis uses on
dependent variable (y) and one independent variable (x). In this simple model, a straight line
approximates the relationship between the dependent variable and the independent variable.
The simple linear regression equation is represented like this: Ε(y) = (β0 +β1 x). The simple
linear regression equation is graphed as a straight line.
A regression line can show a positive linear relationship, a negative linear relationship, or no
relationship. If the graphed line in a simple linear regression is flat (not sloped), there is no
relationship between the two variables. If the regression line slopes upward with the lower
end of the line at the y intercept (axis) of the graph, and the upper end of line extending
upward into the graph field, away from the x intercept (axis) a positive linear relationship
exists. If the regression line slopes downward with the upper end of the line at the y intercept
(axis) of the graph, and the lower end of line extending downward into the graph field,
toward the x intercept (axis) a negative linear relationship exists.
y = B0 + B1 * x
This is a line where y is the output variable we want to predict, x is the input variable we
know and B0 and B1 are coefficients that we need to estimate that move the line around.
Technically, B0 is called the intercept because it determines where the line intercepts the y-
axis. In machine learning we can call this the bias, because it is added to offset all predictions
that we make. The B1 term is called the slope because it defines the slope of the line or how x
translates into a y value before we add our bias.
The goal is to find the best estimates for the coefficients to minimize the errors in predicting
y from x.
Simple regression is great, because rather than having to search for values by trial and error
or calculate them analytically using more advanced linear algebra, we can estimate them
directly from our data.
Program:
Output:
Conclusion: -
Signature
Department of Information Technology
Viva Question:
1) What is linear regression?
2) What is the use of regularisation?
3) How to choose the value of the parameter learning rate (α)?
Practical No: 8
Aim: To create and evaluate a Naïve Bayes classification model for the Iris dataset.
Theory:
Classification Algorithms:
- Random Forest
- Decision Trees
- Naïve Bayes Classifier
- Logistic Regression
- Support Vector Machine
In this practical we will be focusing on Naïve Bayes classifier.
This classification algorithm is based on applying Bayes Theorem with strong (naïve)
independence assumptions between the features. Since its based on Bayes theorem, it is a
probabilistic classifier which means that it predicts on the basis of the probability of an
object.
Naïve Bayes algorithm is one of the simplest and most effective classification algorithms that
can produce fast models with high accuracy.
In probability theory, Bayes theorem is used to describe the probability of an event based on
prior knowledge of conditions that might be related to the event. It depends on the concept of
conditional probability.
P(A|B) = P(B|A)P(A)
P(B)
Where,
P(B|A) is Likelihood Probability or probability of event B occurring given the event A is true.
P(A) and P(B) are the probabilities of A and B individually without any conditions.
- It is one of the fastest and easiest Machine Learning model for classification
problems.
- It can be used for binary as well as multi class classifications.
Disadvantages of Naïve Bayes Classifier:
- Gaussian: This model assumes that the features follow a gaussian or normal
distribution.
- Multinomial: This is used when the data is multinomial distributed. It is primarily
used for document classification problem, etc.
- Bernouli: It works similar to the Multinomial classifier, but predictor variables are the
independent Boolean variable.
- PROGRAM:
- #importing all packages
- import pandas as pd
- import numpy as np
- from sklearn.naive_bayes import GaussianNB
- from sklearn.model_selection import train_test_split
- from sklearn.metrics import confusion_matrix
- from sklearn.metrics import accuracy_score
-
- #Loading the Iris dataset using the 'read_csv()' function from pandas library
#The read_csv() function read the contents of a dataset stored in .csv or comma separated
values format
data = pd.read_csv("Iris.csv")
print(data.head())
#Separating the dependent variable and the independent variables using dataframe slicing
techniques from pandas library
#The first parameter is the set of independent variables, the second parameter is the
dependent variable
#The 'test_size' parameter determines the size of the testing dataset. Here we will be using
20% of the dataset for testing
#Model Creating
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_predictions = classifier.predict(X_test)
print(y_predictions)
OUTPUT:
CONCLUSION:
Signature
Department of Information Technology
Viva Questions