7IT07

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Vidarbha Youth Welfare Society’s

Prof. Ram Meghe Institute of Technology & Research

Badnera, Amravati (M.S.) 444701

Practical Record

Semester VII
(Subject code: 7IT07)
Subject: Machine Learning Lab
(Academic Year: 2022-2023)

Name of Student: ______________________________________________

Roll No: ____________ __________ Section: _______________________

Department of Information Technology


Phone: (0721) - 2580402/2681246
Fax No. 0721 – 2681337
Website : www.mitra.ac.in

Sant Gadge Baba Amravati University


Vidarbha Youth Welfare Society’s

Prof. Ram Meghe Institute of Technology & Research


Badnera, Amravati (M.S.) 444701

CERTIFICATE

This is to certify that Mr. / Miss___________________________________________


Enrollment No. ____________________ Roll No.__________ Section: _____ of B.E.
(IT) Semester VII has satisfactorily completed the term work of the subject
Machine Learning Lab (7IT07) prescribed by Sant Gadge Baba University
Amravati during the academic term 2022-2023.

Date: Signature of the Faculty


Department of Information Technology
INDEX

Sr.No Title Page No.

1 Vision and Mission of the Institute and Program i

2 Program Educational Objectives and Program Outcomes ii

3 Syllabus, Course Learning Objectives and Course Outcomes iii

4 Assessment Format i.e., ACIPV(Guidelines) iv

5 Aim of the Lab Manual v

6 List of experiment vi

7 a.Title/Aim

b.Apparatus/Components used

c. Theory

d. Procedure/Steps

e. Observations

f. Output/Result

g. Conclusion.

h. Viva Questions
Machine Learning Lab [7IT07]

1. Vision and Mission of the Institute and Program:


 Mission & Vision statement of the Institute:

VISION
To become a pace-setting
Centre of excellence believing in three
Universal values namely
Synergy, Trust and Passion,
with zeal to serve the Nation
in the global scenario

MISSION
M1: To achieve the highest standard in technical education through the state-of-the-art
pedagogy and enhanced industry Institute linkages.
M2: To inculcate the culture of research in core and emerging areas.
M3: To strive for overall development of students so as to nurture ingenious technocrats as
well as responsible citizens.

 Mission & Vision statement of the Department of Information Technology:

VISION
Attaining growing needs of industry and society through
Information Technology with ethical values.
MISSION

• To become leading education center by inspiring the students to become competent IT


Engineers
• To make students more innovative and research oriented, to improve their ability to
provide appropriate support for industry and society
• To train students to adapt the life-long learning with ethical values.

 Quality Policy of the Institute

QUALITY POLICY
“Striving for Excellence in the Quality Engineering Education”

Department of Information Technology, PRMIT&R, Badnera Page i


Machine Learning Lab [7IT07]

2. Program Educational Objective & Program Outcomes (PEO’s & PO’s):


 Program Education Objectives (PEO’s)
PEO1: To provide a sustainable foundation in mathematics, science and engineering to enable them
to become competent IT Engineers
PEO2: To deliver a comprehensive education in Information Technology and related engineering
fields to ensure the core competency to be successful in Industry and in higher studies.
PEO3: To analyze, design and create solutions in alignment with industry and emerging trends in
multidisciplinary environments.
PEO4: To communicate effectively as a part of team for self-learning, lifelong learning and career
enhancement in industry and society.
 Program Outcomes (PO’s)
On completion of the course a graduate of Information Technology program will be able to:
PO1: Engineering Knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals and an engineering specialization to solution of complex engineering problems.
PO2: Problem Analysis: Identify, formulate, review research literature, and analyse complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences and engineering sciences.
PO3: Design/Development of Solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
PO4: Conduct Investigations of Complex Problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions for complex problems.
PO5: Modern Tool Usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities
with an understanding of the limitations.
PO6: The Engineer and Society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
PO7: Environment and Sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9: Individual and Team Work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.
PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long Learning: Recognize the need for, and have the preparation and ability to engage
in independent and lifelong learning in the broadest context of technological change.
 Program Specific Outcomes (PSOs)
PSO1: Apply core aspects of Information Technology, Networking, Internet of Things and Security
to pursue successful career in IT industry.
PSO2: Develop, analyze and find IT solutions through programming paradigm, Web Designing and
Cloud Computing.

Department of Information Technology, PRMIT&R, Badnera Page ii


Machine Learning Lab [7IT07]

3. Syllabus (Theory + Practical)


Course Number and Title : Machine Learning(7IT04/7IT07)
Credits : 4 (Theory +Lab)
Name of Faculty : Prof. S. N. Sarda
Course Type : Theory + Lab
Compulsory/ Elective : Elective
Teaching Methods : Lecture : 3 Hrs/ week.
Laboratory : 2 Hrs/ week.
Course Assessment: Exams : 2 Class Tests
Semester end examination by SGBAU
Grading Policy : 20% 2 class tests and 1 Improvement Test
80% Semester end examinations
25 Internal Marks + 25 Marks External Viva.

Unit wise Course Contents in brief:


UNIT 1: Machine Learning: The three different types of machine learning, Introduction to the
basic terminology and notations, A roadmap for building machine learning systems, Using
Python for machine learning, Training Simple Machine Learning Algorithms for
Classification, Artificial neurons – a brief glimpse into the early history of machine learning,
Implementing a perception learning algorithm in Python, Adaptive linear neurons and the
convergence of learning.
UNIT 2: A Tour of Machine Learning Classifiers Using scikit-learn, Choosing a classification
algorithm, First steps with scikit-learn – training a perceptron, Modeling class probabilities
via logistic regression, Maximum margin classification with support vector machines,
Solving nonlinear problems using a kernel SVM, Decision tree learning, K-nearest neighbors
– a lazy learning algorithm.
UNIT 3: Data Preprocessing, Hyperparameter Tuning: Building Good Training Sets, Dealing
with missing data, Handling categorical data, Partitioning a dataset into separate training and
test sets, Bringing features onto the same scale, Selecting meaningful features, Assessing
feature importance with random forests, Compressing Data via Dimensionality Reduction,
Unsupervised dimensionality reduction via principal component analysis, Supervised data
compression via linear discriminant analysis, Using kernel principal component analysis for
nonlinear mappings, Learning Best Practices for Model Evaluation and Hyperparameter
Tuning, Streamlining workflows with pipelines, Using k-fold cross-validation to assess
model performance.
UNIT 4: Regression Analysis: Predicting Continuous Target Variables, Introducing linear
regression, Exploring the Housing dataset, Implementing an ordinary least squares linear
regression model, Fitting a robust regression model using RANSAC, Evaluating the
performance of linear regression models, Using regularized methods for regression, Turning a
linear regression model into a curve – polynomial regression.
UNIT 5: Dealing with nonlinear relationships using random forests, Working with Unlabeled
Data – Clustering Analysis, Grouping objects by similarity using k-means, Organizing
clusters as a hierarchical tree, Locating regions of high density via DBSCAN.

Department of Information Technology, PRMIT&R, Badnera Page iii


Machine Learning Lab [7IT07]

UNIT 6: Multilayer Artificial Neural Network and Deep Learning: Modeling complex
functions with artificial neural networks, Classifying handwritten digits, Training an artificial
neural network, About the convergence in neural networks, A few last words about the neural
network implementation, Parallelizing Neural Network Training with Tensor Flow, Tensor
Flow and training performance.
Text Books
1. Sebastian Raschka, and Vahid Mirjalili ―Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and Tensor Flow.
Reference books:
1. Andriy Burkov ― The Hundred - Page ‘Machine Learning Book’.
2. Aurélien Géron ―’Hands-on Machine Learning’ with Scikit-Learn, Keras, and Tensor
Flow: Concepts, Tools, and Techniques to Build Intelligent Systems.
3. Andreas C. Müller & Sarah Guido ―Introduction to Machine Learning with Python: A
Guide for Data Scientists.
4. Chris Albon ―Machine Learning with Python Cookbook: Practical Solutions from
Preprocessingto Deep Learning

Units covered in the course


and contact hours per unit: Unit 1: 6 hrs Unit 2: 6 hrs Unit 3: 6 hrs
Unit 4: 6 hrs Unit 5: 6 hrs Unit 6: 6 Hrs

Department of Information Technology, PRMIT&R, Badnera Page iv


Machine Learning Lab [7IT07]

Assessment Format i.e., ACIPV(Guidelines):


Guidelines for Awarding Internal Marks for Practical
Each experiment/ practical carries 25 marks. The student shall be evaluated for 25 marks as
per the following guidelines. At the end of the semester, internal assessment marks for
practical shall be the average of marks scored in all the experiments.
a. Attendance (5 marks): These 5 marks will be given to the regularity of a student. If the
student is present on the scheduled time, he/ she will be awarded 5 marks. Otherwise, will
be given 2.5 marks for his attendance.
b. Competency (5 marks): Here the basic aim is to check whether the student has
developed the skill to write down the code on his own and debug. If a student executes
the practical on scheduled day within allocated time; he will be awarded full marks.
Otherwise, marks will be given according to level of completion/ execution of practical.
c. Innovation (5 marks): Here the basic objective is to explore the innovative ideas from
the students with respect to the corresponding practical or how innovatively he interprets
the aim of the practical. It is expected that the students must be aware about scope of
practical precisely such that in future, they could implement task in various applications.
d. Performance/Participation in-group activity (5 marks): These marks will be given on
how a student is actively participating the group. If the student is performing the practical
as the part of a group he must participate actively.
e. Viva-Voce (5 marks): These 5 marks will be totally based on the performance in viva-
voce. There shall be viva-voce on the completion of each practical in the same practical
session. The student shall get the record checked before next practical.
Assessment Sheet

Regular Assessment of Experiment during the practical session


(Sample Sheet)
Regular Assessment
Year: -
Branch: -

A C I P V Total
Roll no.
(5marks) (5marks) (5marks) (5marks) (5marks) (25Marks)

(A-Attendance, C- Competency, I-Innovation, P-Performance, V-Viva)

Department of Information Technology, PRMIT&R, Badnera Page v


Machine Learning Lab [7IT07]

1. Aim of the Lab manual:


This lab manual is designed for the Final year (VII Semester) students of Information
Technology in the Machine Learning (7IT04/7IT07). The main aim of this Manual is to
provide guidelines to students for smooth execution of lab sessions. It has been prepared as
per university curriculum & covers various aspects of Machine Learning.

Department of Information Technology, PRMIT&R, Badnera Page vi


Machine Learning Lab [7IT07]

 6.List of Experiments

Department of Information Technology

Subject: Block- Machine Learning Lab Semester: VII

University Code: 7IT07

Sr. Date Page No. Remark


LIST OF EXPERIMENTS
No.

1 To Installing Anaconda and Python

2 To understand How to get datasets for


Machine Learning
3 To study of Data Preprocessing in Machine
learning
4 Write a python program to
compute

Central tendency measures: Mean, Median,


Mode
Measure of Dispersion: Variance, Standard
Deviation

5 To Study of Python Libraries for ML


application such as Pandas and Matplotlib.

6 Implement the FIND-S algorithm for finding


the most specific hypothesis based on a given
set of training data samples. Read the training
data from a .CSV file.

7 Write a Python program to implement


Simple Linear Regression.
8 To create and evaluate a Naïve Bayes
classification model for the Iris dataset.

Department of Information Technology, PRMIT&R, Badnera Page vii


Machine Learning Lab [7IT07]

Practical No. 1

Aim: To Installing Anaconda and Python

Software Required: Python

Theory: In order to use Python for machine learning, we need to install it on our computer
system with a compatible Integrated Development Environment (IDE).

Anaconda distribution is a free and open-source platform for Python/R programming


languages. It can be easily installed on any OS such as Windows, Linux, and MAC OS. It
provides more than 1500 Python/R data science packages which are suitable for developing
machine learning and deep learning models. Anaconda distribution provides installation of
Python with various IDE's such as Jupyter Notebook, Spyder, Anaconda prompt, etc. Hence
it is a very convenient packaged solution which you can easily download and install in your
computer. It will automatically install Python and some basic IDEs and libraries with it.

Below some steps are given to show the downloading and installing process of Anaconda and
IDE:

Step-1: Download Anaconda Python:

To download Anaconda in your system, firstly, open your favorite browser and type
Download Anaconda Python, and then click on the first link as given in the below image.
Alternatively, you can directly download it by clicking on this
link, https://www.anaconda.com/distribution/#download-section.

After clicking on the first link, you will reach to download page of Anaconda, as shown in
the below image:

Department of Information Technology, PRMIT&R, Badnera Page 1


Machine Learning Lab [7IT07]

Since, Anaconda is available for Windows, Linux, and Mac O.S., hence, you can download it
as per your OS type by clicking on available options shown in below image. It will provide
you Python 2.7 and Python 3.7 versions, but the latest version is 3.7, hence we will download
Python 3.7 version. After clicking on the download option, it will start downloading on your
computer.

Department of Information Technology, PRMIT&R, Badnera Page 2


Machine Learning Lab [7IT07]

Step- 2: Install Anaconda Python (Python 3.7 version):

Once the downloading process gets completed, go to downloads → double click on the ".exe"
file (Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will open a setup window
for Anaconda installations as given in below image, then click on Next.

It will open a License agreement window click on "I Agree" option and move further.

Department of Information Technology, PRMIT&R, Badnera Page 3


Machine Learning Lab [7IT07]

In the next window, you will get two options for installations as given in the below image.
Select the first option (Just me) and click on Next

Now you will get a window for installing location, here, you can leave it as default or change
it by browsing a location, and then click on Next. Consider the below image:

Now select the second option, and click on install.

Department of Information Technology, PRMIT&R, Badnera Page 4


Machine Learning Lab [7IT07]

Once the installation gets complete, click on Next.

Now installation is completed, tick the checkbox if you want to learn more about Anaconda
and Anaconda cloud. Click on Finish to end the process.

Department of Information Technology, PRMIT&R, Badnera Page 5


Machine Learning Lab [7IT07]

Note: Here, we will use the Spyder IDE to run Python programs.

Step- 3: Open Anaconda Navigator


o After successful installation of Anaconda, use Anaconda navigator to launch a Python
IDE such as Spyder and Jupyter Notebook.
o To open Anaconda Navigator, click on window Key and search for Anaconda
navigator, and click on it. Consider the below image:

o After opening the navigator, launch the Spyder IDE by clicking on the Launch button
given below the Spyder. It will install the Spyder IDE in your system.

Department of Information Technology, PRMIT&R, Badnera Page 6


Machine Learning Lab [7IT07]

Run your Python program in Spyder IDE.

Open Spyder IDE, it will look like the below image:

Write your first program, and save it using the .py extension.

Run the program using the triangle Run button.

You can check the program's output on console pane at the bottom right side.

Department of Information Technology, PRMIT&R, Badnera Page 7


Machine Learning Lab [7IT07]

Conclusion:

Viva Question:

1) Why was Machine Learning Introduced? ...


2) What is Machine Learning?
3) What is Anaconda?
4) What are Different Types of Machine Learning algorithms

Signature

Department of Information Technology

Department of Information Technology, PRMIT&R, Badnera Page 8


Machine Learning Lab [7IT07]

Practical No. 2

Aim: To understand How to get datasets for Machine Learning

Software required: Python

Theory:

A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an example
of the dataset

Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.

Types of data in datasets

o Numerical data:Such as house price, temperature, etc.


o Categorical data:Such as Yes/No, True/False, Blue/green, etc.
o Ordinal data:These data are similar to categorical data but can be measured on the
basis of comparison.

Note: A real-world dataset is of huge size, which is difficult to manage and process at the
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset.

Department of Information Technology, PRMIT&R, Badnera Page 9


Machine Learning Lab [7IT07]

Need of Dataset

To work with machine learning projects, we need a huge amount of data, because, without
the data, one cannot train ML/AI models. Collecting and preparing the dataset is one of the
most crucial parts while creating an ML/AI project.

The technology applied behind any ML projects cannot work properly if the dataset is not
well prepared and pre-processed.

During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:

o Training dataset:
o Test Dataset

Popular sources for Machine Learning datasets

Below is the list of datasets which are freely available for the public to work on it:

Department of Information Technology, PRMIT&R, Badnera Page 10


Machine Learning Lab [7IT07]

Kaggle Datasets

Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve difficult
Data Science related tasks.

Kaggle provides a high-quality dataset in different formats that we can easily find and
download.

2. UCI Machine Learning Repository

UCI Machine learning repository is one of the great sources of machine learning datasets.
This repository contains databases, domain theories, and data generators that are widely used
by the machine learning community for the analysis of ML algorithms.

Department of Information Technology, PRMIT&R, Badnera Page 11


Machine Learning Lab [7IT07]

Since the year 1987, it has been widely used by students, professors, researchers as a primary
source of machine learning dataset.

It classifies the datasets as per the problems and tasks of machine learning such
as Regression, Classification, Clustering, etc. It also contains some of the popular datasets
such as the Iris dataset, Car Evaluation dataset, Poker Hand dataset, etc.

3. Datasets via AWS

We can search, download, access, and share the datasets that are publicly available via AWS
resources. These datasets can be accessed through AWS resources but provided and
maintained by different government organizations, researches, businesses, or individuals.

Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.

This source provides the various types of datasets with examples and ways to use the dataset.
It also provides the search box using which we can search for the required dataset. Anyone
can add any dataset or example to the Registry of Open Data on AWS.

Department of Information Technology, PRMIT&R, Badnera Page 12


Machine Learning Lab [7IT07]

4. Google's Dataset Search Engine

Google dataset search engine is a search engine launched by Google on September 5,


2018. This source helps researchers to get online datasets that are freely available for use.

5. Microsoft Datasets

The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing, computer
vision, and domain-specific sciences.

Using this resource, we can download the datasets to use on the current device, or we can
also directly use it on the cloud infrastructure.

6. Awesome Public Dataset Collection

Department of Information Technology, PRMIT&R, Badnera Page 13


Machine Learning Lab [7IT07]

Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is
better to check the license before downloading the dataset.

7. Government Datasets

There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.

The goal of providing these datasets is to increase transparency of government work among
the people and to use the data in an innovative approach. Below are some links of
government datasets:

o Indian Government dataset(https://data.gov.in/)


o US Government Dataset(https://www.data.gov/)
o Northern Ireland Public Sector Datasets
o European Union Open Data Portal

Department of Information Technology, PRMIT&R, Badnera Page 14


Machine Learning Lab [7IT07]

8. Computer Vision Datasets

Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you can
refer to this source.

9. Scikit-learn dataset

Scikit-learn is a great source for machine learning enthusiasts. This source provides both toy
and real-world datasets. These datasets can be obtained from sklearn.datasets package and
using general dataset API.

Department of Information Technology, PRMIT&R, Badnera Page 15


Machine Learning Lab [7IT07]

The toy dataset available on scikit-learn can be loaded using some predefined functions such
as, load_boston([return_X_y]), load_iris([return_X_y]), etc, rather than importing any file
from external sources. But these datasets are not suitable for real-world projects.

Conclusion:

Viva Question:

1) What is Data Set?


2) What is CSV?
3) How data Set are created?
4) What is the Need of Data Set?
5) What is the use of Scikit Learn?

Signature

Department of Information Technology

Department of Information Technology, PRMIT&R, Badnera Page 16


Machine Learning Lab [7IT07]

Practical No. 3
Aim: To study of Data Preprocessing in Machine learning

Software required: Python, Anaconda

Theory:

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.

Department of Information Technology, PRMIT&R, Badnera Page 17


Machine Learning Lab [7IT07]

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets
in programs.

2) Importing Libraries

In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:

 Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices. So, in
Python, we can import it as:

import numpy as np

Here we have used np, which is a short name for Numpy, and it will be used in the whole

program.

 Matplotlib: The second library is matplotlib, which is a Python 2D plotting library,


and with this library, we need to import a sub-library pyplot. This library is used to
plot any type of charts in Python for the code. It will be imported as below:

import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

 Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets

Department of Information Technology, PRMIT&R, Badnera Page 18


Machine Learning Lab [7IT07]

Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow the
below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required
dataset.

Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.

read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.

We can use read_csv function as below:

data_set= pd.read_csv('Dataset.csv')

Department of Information Technology, PRMIT&R, Badnera Page 19


Machine Learning Lab [7IT07]

Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:

As in the above image, indexing is started from 0, which is the default indexing in
Python. We can also change the format of our dataset by clicking on the format
option.

 Extracting dependent and independent variables:


In machine learning, it is important to distinguish the matrix of features (independent
variables) and dependent variables from dataset. In our dataset, there are three
Independent variables that are Country, Age, and Salary, and one is a dependent
variable which is Purchased.

 Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is


used to extract the required rows and columns from the dataset.

x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is
for all the columns. Here we have used :-1, because we don't want to take the last column
as it contains the dependent variable. So by doing this, we will get the matrix of features.

Department of Information Technology, PRMIT&R, Badnera Page 20


Machine Learning Lab [7IT07]

By executing the above code, we will get output as

4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.

Department of Information Technology, PRMIT&R, Badnera Page 21


Machine Learning Lab [7IT07]

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

#handling missing data (Replacing missing data with the mean value)
1. from sklearn.preprocessing import Imputer
2. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
3. #Fitting imputer object to the independent variables x.
4. imputerimputer= imputer.fit(x[:, 1:3])
5. #Replacing missing data with the calculated mean value
6. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

Department of Information Technology, PRMIT&R, Badnera Page 22


Machine Learning Lab [7IT07]

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

Department of Information Technology, PRMIT&R, Badnera Page 23


Machine Learning Lab [7IT07]

6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.

Conclusion: -

Viva Question:

1). What is a CSV File?


2). How to Import Data Set?

3). What is the use of Matplot Lib?

4). What is the use of NumPy?

5). How to Extract independent variable.

Signature

Department of Information Technology

Department of Information Technology, PRMIT&R, Badnera Page 24


Machine Learning Lab [7IT07]

Practical No. 4

Aim: Write a python program to compute

a) Central tendency measures: Mean, Median, Mode


b) Measure of Dispersion: Variance, Standard Deviation

Objective:
• To understand the programming constructs in python
• To understand basic statistics concepts and implement its formulae using
python

Software required: Python

Theory:

Numeric Types:

Integers:
In Python 3, there is effectively no limit to how long an integer value
can be. Of course, it is constrained by the amount of memory your
system has.

>>>
print(10) 10

>>> type(10)

Department of Information Technology, PRMIT&R, Badnera Page 25


Machine Learning Lab [7IT07]

Floating Point Numbers:


The float type in Python designates a floating-point number. float values are specified with a
decimal point. Optionally, the character e or E followed by a positive or negative integer may
be appended to specify scientific notation

>>> 4.2

4.2

>>>
.4e7
4000
000.
0
Complex Numbers

>>> 2+3j
(2+3j)

>>> type(2+3j)
Complex numbers are specified as <real part>+<imaginary part>j.

Strings:
Strings are sequences of character data. The string type in Python is called str. String literals
may be delimited using either single or double quotes. All the characters between the opening
delimiter and matching closing delimiter are part of the string.

A string in Python can contain as many characters as you wish. The only limit is your

>>> print("I am a
string.") I am a string.

>>> type("I am a string.")

<class 'str'>

machine‘s memory resources. A string can also be empty.

Department of Information Technology, PRMIT&R, Badnera Page 26


Machine Learning Lab [7IT07]

A raw string literal is preceded by r or R, which specifies that escape sequences in the

>>>
print('foo\nbar')
foo
associated string are not translated. The backslash character is left in the string.

>>>
print(r'foo\nbar')
foo\nbar

>>>
print('foo\\bar')

Boolean Type:

Python 3 provides a Boolean data type. Objects of Boolean type may have one of two values,

>>> type(True)

<class 'bool'>

>>> type(False)
True or False.

Python List:

List is an ordered sequence of items. It is one of the most used datatype in Python and is very
flexible. All the items in a list do not need to be of the same type. Declaring a list is pretty
straight forward. Items separated by commas are enclosed within brackets [ ]. We can use the
slicing operator [ ] to extract an item or a range of items from a list. Index starts form 0 in
Python.

Lists are mutable, meaning, value of elements of a list can be altered.

>> a = [1, 2.2,


> 'python']

Department of Information Technology, PRMIT&R, Badnera Page 27


Machine Learning Lab [7IT07]

Python Tuple:

Tuple is an ordered sequence of items same as list. The only difference is that tuples are
immutable. Tuples once created cannot be modified. Tuples are used to write-protect data and
are usually faster than list as it cannot change dynamically. It is defined within parentheses ()
where items are separated by commas. We can use the slicing operator [] to extract items but

>>> t = (5,'program', 1+3j)


we cannot change its value.

Python Set:

Set is an unordered collection of unique items. Set is defined by values separated by comma
inside braces { }. Items in a set are not ordered. We can perform set operations like union,
intersection on two sets. Set have unique values. They eliminate duplicates. Since, set are
unordered collection, indexing has no meaning. Hence the slicing operator [] does not work.
>>> a =
{1,2,2,3,3,3}
>>>
a
{1, 2,
3}

Python Dictionary:

Dictionary is an unordered collection of key-value pairs. It is generally used when we have a


huge amount of data. Dictionaries are optimized for retrieving data. We must know the key to
retrieve the value. In Python, dictionaries are defined within braces {} with each item being a
pair in the form key:value. Key and value can be of any type. We use key to retrieve the
respective value. But not the other way around.
>>> d =
{1:'value','key':2}
>>>
type(d)
<class
'dict'>

Operators in Python

Operators are used to perform operations on variables and values. Python divides the
operators in the following groups:

1. Arithmetic operators
Arithmetic operators are used to perform mathematical operations like addition, subtraction,
multiplication and division.

Department of Information Technology, PRMIT&R, Badnera Page 28


Machine Learning Lab [7IT07]

+ Addition: adds two operands x + y


- Subtraction: subtracts two operands x - y
* Multiplication: multiplies two operands x*y
/ Division (float): divides the first operand by the second x/y
// Division (floor): divides the first operand by the second x // y
% Modulus: returns the remainder when first operand is divided by the second
x%y

2. Comparison/Relational operators
Relational operators compares the values. It either returns True or False
according to the condition.
> Greater than: True if left operand is greater than the right x>y

< Less than: True if left operand is less than the right x < y

== Equal to: True if both operands are equal x == y

!= Not equal to - True if operands are not equal x != y

>= Greater than or equal to: True if left operand is greater than or equal to the
right x >= y

<= Less than or equal to: True if left operand is less than or equal to the right x<=
y

3. Logical operators
Logical operators perform Logical AND, Logical OR and Logical NOT operations.

and Logical AND: True if both the operands are true x and y
or Logical OR: True if either of the operands is true x or y
not Logical NOT: True if operand is false not x
Control Statements in Python

1. Python Decision Making Statements

Department of Information Technology, PRMIT&R, Badnera Page 29


Machine Learning Lab [7IT07]

2. Python Loops Statements

3. Loop Control Statements

Executing Python Program

This describes the environment in which Python programs are executed. This describes the
runtime behavior of the interpreter, including program startup, configuration, and program
termination

Anaconda Navigator – Jupyter Notebook

Anaconda is a free and open-source distribution of the Python and R programming languages
for scientific computing, machine learning and data science that aims to simplify package
management and deployment.

The notebook extends the console-based approach to interactive computing in a qualitatively


new direction, providing a web-based application suitable for capturing the whole
computation process: developing, documenting, and executing code, as well as
communicating the results. A notebook kernel is a ―computational engine‖ that executes
the code contained in a Notebook document. The ipython kernel, referenced in this guide,
executes python code.

The Jupyter notebook combines two components:

A web application: a browser-based tool for interactive authoring of documents which


combine explanatory text, mathematics, computations and their rich media output.

Department of Information Technology, PRMIT&R, Badnera Page 30


Machine Learning Lab [7IT07]

Notebook documents: a representation of all content visible in the web application, including
inputs and outputs of the computations, explanatory text, mathematics, images, and rich
media representations of objects.

PyCharm IDE
PyCharm is an integrated development environment (IDE) used in computer programming,
specifically for the Python language. It is developed by the JetBrains. It provides code
analysis, a graphical debugger, an integrated unit tester, integration with version control
systems, and supports web development with Django as well as Data Science with Anaconda

Google COLAB

Colaboratory is a research tool for machine learning education and research. It‘s a Jupyter
notebook environment that requires no setup to use. Google Colab is a free cloud service and
now it supports free GPU! You can: improve your Python programming language coding
skills. To start working with Colab you first need to log in to your google account, then go to
this link https://colab.research.google.com.

Central Tendency Measures

A measure of central tendency (also referred to as measures of centre or central location) is a


summary measure that attempts to describe a whole set of data with a single value that
represents the middle or centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each
of these measures describes a different indication of the typical or central value in the
distribution.

Mean

The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average. Looking at the retirement age
distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values


(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of observations

(11) which equals 56.6 years.

Median

The median is the middle value in distribution when the values are arranged in ascending or
descending order.

Department of Information Technology, PRMIT&R, Badnera Page 31


Machine Learning Lab [7IT07]

In a distribution with an odd number of observations, the median value is the middle value.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Looking at the retirement age distribution (which has 11 observations), the median is the
middle value, which is 57 years.

When the distribution has an even number of observations, the median value is the mean of
the two middle values. In the following distribution,

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The middle two values are 56 and 57; therefore the median equals 56.5 years.

Mode

The mode is the most commonly occurring value in a distribution. Consider this dataset
showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

Measure of Dispersion

Measures of spread describe how similar or varied the set of observed values are for a
particular variable (data item). Measures of spread include the range, quartiles and the
interquartile range, variance and standard deviation. The spread of the values can be
measured for quantitative data, as the variables are numeric and can be arranged into a logical
order with a low end value and a high end value.

Variance and Standard Deviation

The variance and the standard deviation are measures of the spread of the data around the
mean. They summarise how close each observed data value is to the mean value.

In datasets with a small spread all values are very close to the mean, resulting in a small
variance and standard deviation. Where a dataset is more dispersed, values are spread further
away from the mean, leading to a larger variance and standard deviation.

The smaller the variance and standard deviation, the more the mean value is indicative of the
whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and
variance are zero.

Department of Information Technology, PRMIT&R, Badnera Page 32


Machine Learning Lab [7IT07]

Programming Steps:

1. Take input from user (list elements)


2. Find the size of list
3. Sort the given lise in ascending or descending form
4. Calculate mean, median and mode of the list
5. Calculate the variance and standard deviation

Program:

Department of Information Technology, PRMIT&R, Badnera Page 33


Machine Learning Lab [7IT07]

Department of Information Technology, PRMIT&R, Badnera Page 34


Machine Learning Lab [7IT07]

Department of Information Technology, PRMIT&R, Badnera Page 35


Machine Learning Lab [7IT07]

Conclusion: -

Signature
Department of Information Technology

Viva Question:

1). What are the measures of central tendency?


2). What is the mode?

3). What is the median?

4). What is the mean?

5) which is the best meaure of centeral Tendency

Department of Information Technology, PRMIT&R, Badnera Page 36


Machine Learning Lab [7IT07]

Practical No. 5

Aim: To Study of Python Libraries for ML application such as Pandas and Matplotlib.

Objective:
 To understand data preprocessing and analysis using Pandas library
 To understand data visualization in the form of 2D graphs and plots using
Matplotlib library

Software Required: Python3.0


Theory: List important ML libraries

Python Libraries for Machine Learning

Numpy

Scipy

Scikit-learn

Theano

TensorFlow

Keras

PyTorch

Pandas

Matplotlib

Importance of Pandas library

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data


structures and data analysis tools for the Python programming language.

Pandas makes importing, analyzing, and visualizing data much easier. It builds on packages
like NumPy and matplotlib to give you a single, convenient, place to do most of your data
analysis and visualization work.

Department of Information Technology, PRMIT&R, Badnera Page 37


Machine Learning Lab [7IT07]

Advantages of Pandas Library

There are many benefits of Python Pandas library, listing them all would probably take more
time than what it takes to learn the library. Therefore, these are the core advantages of using
the Pandas library:

1) Data representation

Pandas provide extremely streamlined forms of data representation. This helps to analyze and
understand data better. Simpler data representation facilitates better results for data science
projects.

2) Less writing and more work done

It is one of the best advantages of Pandas. What would have taken multiple lines in Python
without any support libraries, can simply be achieved through 1-2 lines with the use of
Pandas.

Thus, using Pandas helps to shorten the procedure of handling data. With the time saved, we
can focus more on data analysis algorithms.

3) An extensive set of features

Pandas are really powerful. They provide you with a huge set of important commands and
features which are used to easily analyze your data. We can use Pandas to perform various
tasks like filtering your data according to certain conditions, or segmenting and segregating
the data according to preference, etc.

4) Efficiently handles large data

Wes McKinney, the creator of Pandas, made the python library to mainly handle large
datasets efficiently. Pandas help to save a lot of time by importing large amounts of data very
fast.

5) Makes data flexible and customizable

Pandas provide a huge feature set to apply on the data you have so that you can customize,
edit and pivot it according to your own will and desire. This helps to bring the most out of
your data.

6) Made for Python

Python programming has become one of the most sought after programming languages in the
world, with its extensive amount of features and the sheer amount of productivity it provides.
Therefore, being able to code Pandas in Python, enables you to tap into the power of the
various other features and libraries which will use with Python. Some of these libraries are
NumPy, SciPy, MatPlotLib, etc.

Pandas Library

Department of Information Technology, PRMIT&R, Badnera Page 38


Machine Learning Lab [7IT07]

The primary two components of pandas are the Series and DataFrame.

A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a


collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you
can do with the other, such as filling in null values and calculating the mean.

Reading data from CSVs

With CSV files all you need is a single line to load in the data:

df = pd.read_csv('purchases.csv') df

Let's load in the IMDB movies dataset to begin:

movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

We're loading this dataset from a CSV and designating the movie titles to be our index.

Viewing your data

The first thing to do when opening a new dataset is print out a few rows to keep as a visual
reference. We accomplish this with .head():

movies_df.head()

Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

movies_df.shape

Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we
have 1000 rows and 11 columns in our movies DataFrame.

You'll be going to .shape a lot when cleaning and transforming data. For example, you might
filter some rows based on some criteria and then want to know quickly how many rows were
removed.

Department of Information Technology, PRMIT&R, Badnera Page 39


Machine Learning Lab [7IT07]

Handling duplicates

This dataset does not have duplicate rows, but it is always important to verify you aren't
aggregating duplicate rows.

To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:

temp_df =
movies_df.append(movies_d
f) temp_df.shape
Out: (2000, 11)

Using append()will return a copy without affecting the original DataFrame. We are capturing
this copy in tempso we aren't working with the real data.

Notice call .shape quickly proves our DataFrame rows have doubled. Now we can try
dropping duplicates:

temp_df = temp_df.drop_duplicates()

temp_df.shape

Out:

(1000, 11)

Just like append(), the drop_duplicates() method will also return a copy of your DataFrame,
but this time with duplicates removed. Calling .shapeconfirms we're back to the 1000 rows of
our original dataset.

It's a little verbose to keep assigning DataFrames to the same variable like in this example.
For this reason, pandas has the inplace keyword argument on many of its methods. Using
inplace=Truewill modify the DataFrame object in place:

temp_df.drop_duplicates(inplace=True)

Now our temp_dfwill have the transformed data automatically.

Another important argument for drop_duplicates()is keep, which has three possible options:

first: (default) Drop duplicates except for the first occurrence.

last: Drop duplicates except for the last occurrence.

False: Drop all duplicates.

Department of Information Technology, PRMIT&R, Badnera Page 40


Machine Learning Lab [7IT07]

Since we didn't define the keep arugment in the previous example it was defaulted to first.
This means that if two rows are the same pandas will drop the second row and keep the first
row. Using lasthas the opposite effect: the first row is dropped.

keep, on the other hand, will drop all duplicates. If two rows are the same then both will be
dropped. Watch what happens to temp_df:

temp_df = movies_df.append(movies_df) # make a new copy


temp_df.drop_duplicates(inplace=True, keep=False)

temp_df.shape

Out:
(0, 11)

Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left
over. If you're wondering why you would want to do this, one reason is that it allows you to
locate all duplicates in your dataset. When conditional selections are shown below you'll see
how to do that.

Column cleanup

Many times datasets will have verbose column names with symbols, upper and lowercase
words, spaces, and typos. To make selecting data by column name easier we can spend a little
time cleaning up their names.

Here's how to print the column names of our dataset:

movies_df.columns

Out:

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime (Minutes)', 'Rating',


'Votes', 'Revenue (Millions)', 'Metascore'],

dtype='object')

Not only does .columnscome in handy if you want to rename columns by allowing for simple
copy and paste, it's also useful if you need to understand why you are receiving a Key Error
when selecting data by column.

We can use the .rename() method to rename certain or all columns via a dict. We don't want
parentheses, so let's rename those:

movies_df.rename(columns={

Department of Information Technology, PRMIT&R, Badnera Page 41


Machine Learning Lab [7IT07]

'Runtime (Minutes)': 'Runtime',

'Revenue (Millions)': 'Revenue_millions'

}, inplace=True)

movies_df.columns

Out:

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',

'Rating', 'Votes', 'Revenue_millions', 'Metascore'], dtype='object')

Excellent. But what if we want to lowercase all names? Instead of using .rename() we could
also set a list of names to the columns like so:

movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',

'rating', 'votes', 'revenue_millions', 'metascore']

movies_df.columns

Out:

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',

'rating', 'votes', 'revenue_millions', 'metascore'], dtype='object')

But that's too much work. Instead of just renaming each column manually we can do a list
comprehension:

movies_df.columns = [col.lower() for col in movies_df] movies_df.columns

Out:

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',

'rating', 'votes', 'revenue_millions', 'metascore'], dtype='object')

list (and dict) comprehensions come in handy a lot when working with pandas and data in
general.

It's a good idea to lowercase, remove special characters, and replace spaces with underscores
if you'll be working with a dataset for some time.

Importance of Matplotlib library

To make necessary statistical inferences, it becomes necessary to visualize your data and
Matplotlib is one such solution for the Python users. It is a very powerful plotting library
useful for those working with Python and NumPy. The most used module of Matplotib is

Department of Information Technology, PRMIT&R, Badnera Page 42


Machine Learning Lab [7IT07]

Pyplot which provides an interface like MATLAB but instead, it uses Python and it is open
source.

General Concepts:

A Matplotlib figure can be categorized into several parts as below:

Figure: It is a whole figure which may contain one or more than one axes (plots). You can
think of a Figure as a canvas which contains plots.

Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It contains
two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.

Axis: They are the number line like objects and take care of generating the graph limits.

Artist: Everything which one can see on the figure is an artist like Text objects, Line2D
objects, collection objects. Most Artists are tied to Axes.

Matplotlib Library

Pyplot is a module of Matplotlib which provides simple functions to add plot elements like
lines, images, text, etc. to the current axes in the current figure.

Make a simple plot

import matplotlib.pyplot as plt import numpy as np

List of all the methods as they appeared.

plot(x-axis values, y-axis values) — plots a simple line graph with x-axis values against y-
axis values

show() — displays the graph

title(―string‖) — set the title of the plot as specified by the string

xlabel(―string‖) — set the label for x-axis as specified by the string

ylabel(―string‖) — set the label for y-axis as specified by the string

figure() — used to control a figure level attributes

subplot(nrows, ncols, index) — Add a subplot to the current figure

suptitle(―string‖) — It adds a common title to the figure specified by the string

subplots(nrows, ncols, figsize) — a convenient way to create subplots, in a single call. It


returns a tuple of a figure and number of axes.

set_title(―string‖) — an axes level method used to set the title of subplots in a figure

Department of Information Technology, PRMIT&R, Badnera Page 43


Machine Learning Lab [7IT07]

bar(categorical variables, values, color) — used to create vertical bar graphs

barh(categorical variables, values, color) — used to create horizontal bar graphs

legend(loc) — used to make legend of the graph

xticks(index, categorical variables) — Get or set the current tick locations and labels of the x-
axis

pie(value, categorical variables) — used to create a pie chart

hist(values, number of bins) — used to create a histogram

xlim(start value, end value) — used to set the limit of values of the x-axis

ylim(start value, end value) — used to set the limit of values of the y-axis

scatter(x-axis values, y-axis values) — plots a scatter plot with x-axis values against y-axis
values

axes() — adds an axes to the current figure

set_xlabel(―string‖) — axes level method used to set the x-label of the plot specified as a string

set_ylabel(―string‖) — axes level method used to set the y-label of the plot specified as a string

scatter3D(x-axis values, y-axis values) — plots a three-dimensional scatter plot with x-axis
values against y-axis values

plot3D(x-axis values, y-axis values) — plots a three-dimensional line graph with x- axis
values against y-axis values

Here we import Matplotlib‘s Pyplot module and Numpy library as most of the data that we
will be working with will be in the form of arrays only.

Department of Information Technology, PRMIT&R, Badnera Page 44


Machine Learning Lab [7IT07]

We pass two arrays as our input arguments to Pyplot‘s plot() method and use show() method
to invoke the required plot. Here note that the first array appears on the x-axis and second
array appears on the y-axis of the plot. Now that our first plot is ready, let us add the title, and
name x-axis and y-axis using methods title(), xlabel() and ylabel() respectively.

We can also specify the size of the figure using method figure()and passing the values as a
tuple of the length of rows and columns to the argument figsize

Department of Information Technology, PRMIT&R, Badnera Page 45


Machine Learning Lab [7IT07]

With every X and Y argument, you can also pass an optional third argument in the form of a
string which indicates the colour and line type of the plot. The default format is b- which
means a solid blue line. In the figure below we use go which means green circles. Likewise,
we can make many such combinations to format our plot.

We can also plot multiple sets of data by passing in multiple sets of arguments of X and Y
axis in the plot()method as shown.

Department of Information Technology, PRMIT&R, Badnera Page 46


Machine Learning Lab [7IT07]

Multiple plots in one figure:

We can use subplot() method to add more than one plots in one figure. In the image below,
we used this method to separate two graphs which we plotted on the same axes in the
previous example. The subplot() method takes three arguments: they are nrows, ncols and
index. They indicate the number of rows, number of columns and the index number of the
sub-plot. For instance, in our example, we want to create two sub-plots in one figure such that
it comes in one row and in two columns and hence we pass arguments (1,2,1)and (1,2,2)in the
subplot() method. Note that we have separately used title()method for both the subplots. We
use suptitle() method to make a centralized title for the figure.

If we want our sub-plots in two rows and single column, we can pass arguments (2,1,1)

and (2,1,2)

Department of Information Technology, PRMIT&R, Badnera Page 47


Machine Learning Lab [7IT07]

The above way of creating subplots becomes a bit tedious when we want many subplots in
our figure. A more convenient way is to use subpltots() method. Notice the difference of ‘s’
in both the methods. This method takes two arguments nrows and ncols as number of rows
and number of columns respectively. This method creates two objects:figure and axes which
we store in variables fig and ax which can be used to change the figure and axes level
attributes respectively. Note that these variable names are chosen arbitrarily

Department of Information Technology, PRMIT&R, Badnera Page 48


Machine Learning Lab [7IT07]

Department of Information Technology, PRMIT&R, Badnera Page 49


Machine Learning Lab [7IT07]

Program:

Department of Information Technology, PRMIT&R, Badnera Page 50


Machine Learning Lab [7IT07]

Department of Information Technology, PRMIT&R, Badnera Page 51


Machine Learning Lab [7IT07]

Conclusion:-

Signature
Department of Information Technology

Viva Question:

1) What is the different Python Library?

2) What is the use of Scipy?

3) What is the difference between list and tuples in Python?

4) What type of language is python? ...

5) What are local & Global variables in Python?

Department of Information Technology, PRMIT&R, Badnera Page 52


Machine Learning Lab [7IT07]

Practical No: 6

Aim: Implement the FIND-S algorithm for finding the most specific hypothesis based on a
given set of training data samples. Read the training data from a .CSV file.

Create Excel file Weather.csv and save it in same path

Time Weather Temperature Company Humidity Wind Goes

Morning Sunny Warm Yes Mild Strong Yes

Evening Rainy Cold No Mild Normal No

Morning Sunny Moderate Yes Normal Normal Yes

Evening Sunny Cold Yes High Strong Yes

Software required: Python 3.0, anaconda.

Theory: In Machine Learning, concept learning can be termed as “a problem of searching


through a predefined space of potential hypothesis for the hypothesis that best fits the training
examples” – Tom Mitchell. In this article, we will go through one such concept learning
algorithm known as the Find-S algorithm. The find-S algorithm is a basic concept learning
algorithm in machine learning. The find-S algorithm finds the most specific hypothesis
that fits all the positive examples. We have to note here that the algorithm considers only
those positive training example. The find-S algorithm starts with the most specific
hypothesis and generalizes this hypothesis each time it fails to classify an observed positive
training data. Hence, the Find-S algorithm moves from the most specific hypothesis to the
most general Hypothesis.

In order to understand Find-S algorithm, you need to have a basic idea of the following
concepts as well:

Concept Learning

General Hypothesis

Specific Hypothesis

1. Concept Learning

To understand The concept learning with a real-life example. Most of human learning is
based on past instances or experiences. For example, we are able to identify any type of
vehicle based on a certain set of features like make, model, etc., that are defined over a large
set of features.

Department of Information Technology, PRMIT&R, Badnera Page 53


Machine Learning Lab [7IT07]

These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.

Similar to this, machines can also learn from concepts to identify whether an object belongs
to a specific category or not. Any algorithm that supports concept learning requires the
following:

• Training Data
• Target Concept
• Actual Data Objects

2. General Hypothesis

Hypothesis, in general, is an explanation for something. The general hypothesis basically


states the general relationship between the major variables. For example, a general hypothesis
for ordering food would be I want a burger.

G = { ‘?’, ‘?’, ‘?’, …..’?’}

3. Specific Hypothesis

The specific hypothesis fills in all the important details about the variables given in the
general hypothesis. The more specific details into the example given above would be I want a
cheeseburger with a chicken pepperoni filling with a lot of lettuce.

S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}

The Find-S algorithm follows the steps written below:

1. Initialize ‘h’ to the most specific hypothesis.


2. The Find-S algorithm only considers the positive examples and eliminates negative
examples. For each positive example, the algorithm checks for each attribute in the
example. If the attribute value is the same as the hypothesis value, the algorithm
moves on without any changes. But if the attribute value is different than the
hypothesis value, the algorithm changes it to ‘?’.

Now that we are done with the basic explanation of the Find-S algorithm, let us take a look at
how it works.

Looking at the data set, we have six attributes and a final attribute that defines the positive or
negative example. In this case, yes is a positive example, which means the person will go for
a walk.

So now, the general hypothesis is:

h0 = {‘Morning’, ‘Sunny’, ‘Warm’, ‘Yes’, ‘Mild’, ‘Strong’}

Department of Information Technology, PRMIT&R, Badnera Page 54


Machine Learning Lab [7IT07]

This is our general hypothesis, and now we will consider each example one by one, but only
the positive examples.

h1= {‘Morning’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}

h2 = {‘?’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}

Program:

Department of Information Technology, PRMIT&R, Badnera Page 55


Machine Learning Lab [7IT07]

Output:

Department of Information Technology, PRMIT&R, Badnera Page 56


Machine Learning Lab [7IT07]

Conclusion: -

Signature
Department of Information Technology

Viva Question:

1) What is the purpose of Find S- Algorithm


2) What is Hypothesis?
3) What is the basic idea of Find S-algorithm?
4) Explain General Hypothesis
5) Explain Specific Hypothesis.

Department of Information Technology, PRMIT&R, Badnera Page 57


Machine Learning Lab [7IT07]

Practical No: 7

Aim: Write a Python program to implement Simple Linear Regression

Objective:
To understand the concept of simple linear regression
To apply simple linear regression on actual dataset to do prediction

Software required: Python 3.0, Anaconda

Theory:

Types of Learning

A machine is said to be learning from past Experiences(data feed in) with respect to some
class of Tasks, if it‘s Performance in a given Task improves with the Experience.

1) Supervised Learning

How it works: This algorithm consist of a target / outcome variable (or dependent variable)
which is to be predicted from a given set of predictors (independent variables). Using these
set of variables, we generate a function that map inputs to desired outputs. The training
process continues until the model achieves a desired level of accuracy on the training data.
Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN,
Logistic Regression etc.

2) Unsupervised Learning

How it works: In this algorithm, we do not have any target or outcome variable to predict /
estimate. It is used for clustering population in different groups, which is widely used for
segmenting customers in different groups for specific intervention. Examples of
Unsupervised Learning: Apriori algorithm, K-means.

Department of Information Technology, PRMIT&R, Badnera Page 58


Machine Learning Lab [7IT07]

3) Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make specific decisions. It
works this way: the machine is exposed to an environment where it trains itself continually
using trial and error. This machine learns from past experience and tries to capture the best
possible knowledge to make accurate business decisions. Example of Reinforcement
Learning: Markov Decision Process

4) Semi-Supervised Learning

In this type of learning, the algorithm is trained upon a combination of labeled and unlabeled
data. Typically, this combination will contain a very small amount of labeled data and a very
large amount of unlabeled data. The basic procedure involved is that first, the programmer
will cluster similar data using an unsupervised learning algorithm and then use the existing
labeled data to label the rest of the unlabeled data.

o Supervised Learning – Train Me!


o Unsupervised Learning – I am self sufficient in learning
o Reinforcement Learning – My life My rules! (Hit & Trial)

Department of Information Technology, PRMIT&R, Badnera Page 59


Machine Learning Lab [7IT07]

Types of Supervised Learning

Classification : It is a Supervised Learning task where output is having defined


labels(discrete value). For example in above Figure A, Output – Purchased has defined labels
i.e. 0 or 1 ; 1 means the customer will purchase and 0 means that customer won‘t purchase.
The goal here is to predict discrete values belonging to a particular class and evaluate on the
basis of accuracy. It can be either binary or multi class classification. In binary classification,
model predicts either 0 or 1 ; yes or no but in case of multi class classification, model
predicts more than one class.

Example: Gmail classifies mails in more than one classes like social, promotions, updates,
forum.

Regression : It is a Supervised Learning task where output is having continuous value.

Example in above Figure B, Output – Wind Speed is not having any discrete value but is
continuous in the particular range. The goal here is to predict a value as much closer to actual
output value as our model can and then evaluation is done by calculating error value. The
smaller the error the greater the accuracy of our regression model.

Example of Supervised Learning Algorithms:

• Linear Regression
• Nearest Neighbor
• Guassian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest

Regression Analysis

Regression analysis is a powerful statistical method that allows you to examine the
relationship between two or more variables of interest. While there are many types of
regression analysis, at their core they all examine the influence of one or more independent
variables on a dependent variable.

Department of Information Technology, PRMIT&R, Badnera Page 60


Machine Learning Lab [7IT07]

In order to understand regression analysis fully, it‘s essential to comprehend the following
terms:

Dependent Variable: This is the main factor that you‘re trying to understand or predict.

Independent Variables: These are the factors that you hypothesize have an impact on your
dependent variable.

Regression analysis is a form of predictive modelling technique which investigates the


relationship between a dependent (target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding the causal effect
relationship between the variables. For example, relationship between rash driving and
number of road accidents by a driver is best studied through regression. Regression analysis
is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data
points, in such a manner that the differences between the distances of data points from the
curve or line is minimized.

There are multiple benefits of using regression analysis. They are as follows:

It indicates the significant relationships between dependent variable and independent


variable.

It indicates the strength of impact of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the effects of variables measured on different
scales, such as the effect of price changes and the number of promotional activities. These
benefits help market researchers / data analysts / data scientists to eliminate and evaluate the
best set of variables to be used for building predictive models.

There are various kinds of regression techniques available to make predictions. These
techniques are mostly driven by three metrics (number of independent variables, type of
dependent variables and shape of regression line).

Department of Information Technology, PRMIT&R, Badnera Page 61


Machine Learning Lab [7IT07]

Types of Regression:

• Linear Regression
• Logistic Regression
• Polynomial Regression
• Stepwise Regression
• Ridge Regression

Linear Regression

It is one of the most widely known modeling technique. Linear regression is usually among
the first few topics which people pick while learning predictive modeling. In this technique,
the dependent variable is continuous, independent variable(s) can be continuous or discrete,
and nature of regression line is linear.

Linear Regression establishes a relationship between dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as regression line).

It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e


is error term. This equation can be used to predict the value of target variable based on given
predictor variable(s).

It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish relationship between independent and dependent
variables by fitting a best line. This best fit line is known as regression line and represented
by a linear equation

Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us
say, you ask a child in fifth grade to arrange people in his class by increasing order of weight,
without asking them their weights! What do you think the child will do? He / she would
likely look (visually analyze) at the height and build of people and arrange them using a
combination of these visible parameters. This is linear regression in real life! The child has
actually figured out that height and build would be correlated to the weight by a relationship,
which looks like the equation above.

In this equation:

Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept

These coefficients a and b are derived based on minimizing the sum of squared difference of
distance between data points and regression line.

Department of Information Technology, PRMIT&R, Badnera Page 62


Machine Learning Lab [7IT07]

Look at the below example. Here we have identified the best fit line having linear equation
y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person

Types of Linear Regression

Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear
Regression. Simple Linear Regression is characterized by one independent variable. And,
Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1)
independent variables. While finding the best fit line, you can fit a polynomial or curvilinear
regression. And these are known as polynomial or curvilinear regression.

The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear regression
has only 1 independent variable.

Simple Linear Regression

In simple linear regression, each observation consists of two values. One value is for the
dependent variable and one value is for the independent variable.

Simple Linear Regression Analysis The simplest form of a regression analysis uses on
dependent variable (y) and one independent variable (x). In this simple model, a straight line
approximates the relationship between the dependent variable and the independent variable.

The simple linear regression equation is represented like this: Ε(y) = (β0 +β1 x). The simple
linear regression equation is graphed as a straight line.

( β0 is the y intercept of the regression line. β1 is the slope.)

Ε(y) is the mean or expected value of y for a given value of x.

A regression line can show a positive linear relationship, a negative linear relationship, or no
relationship. If the graphed line in a simple linear regression is flat (not sloped), there is no
relationship between the two variables. If the regression line slopes upward with the lower

Department of Information Technology, PRMIT&R, Badnera Page 63


Machine Learning Lab [7IT07]

end of the line at the y intercept (axis) of the graph, and the upper end of line extending
upward into the graph field, away from the x intercept (axis) a positive linear relationship
exists. If the regression line slopes downward with the upper end of the line at the y intercept
(axis) of the graph, and the lower end of line extending downward into the graph field,
toward the x intercept (axis) a negative linear relationship exists.

Formulae to calculate b0 and b1 coefficients.

With simple linear regression we want to model our data as follows:

y = B0 + B1 * x

This is a line where y is the output variable we want to predict, x is the input variable we
know and B0 and B1 are coefficients that we need to estimate that move the line around.

Technically, B0 is called the intercept because it determines where the line intercepts the y-
axis. In machine learning we can call this the bias, because it is added to offset all predictions
that we make. The B1 term is called the slope because it defines the slope of the line or how x
translates into a y value before we add our bias.

The goal is to find the best estimates for the coefficients to minimize the errors in predicting
y from x.

Simple regression is great, because rather than having to search for values by trial and error
or calculate them analytically using more advanced linear algebra, we can estimate them
directly from our data.

Program:

Department of Information Technology, PRMIT&R, Badnera Page 64


Machine Learning Lab [7IT07]

Department of Information Technology, PRMIT&R, Badnera Page 65


Machine Learning Lab [7IT07]

Output:

Conclusion: -

Signature
Department of Information Technology

Viva Question:
1) What is linear regression?
2) What is the use of regularisation?
3) How to choose the value of the parameter learning rate (α)?

Department of Information Technology, PRMIT&R, Badnera Page 66


Machine Learning Lab [7IT07]

Practical No: 8

Aim: To create and evaluate a Naïve Bayes classification model for the Iris dataset.

Software required: Python3.0, anaconda,jupiter

Theory:

Classification Algorithms:

Classification Algorithm is a type of Supervised Machine Learning technique. Classification


algorithms are used to predict the category of new data based on the training data. These
algorithms are used to classify new observations into different groups. In other words the
output variable must be divisible into two or more classes like: Yes-No, Male-Female, Spam-
Not spam etc.

Various kinds of classification algorithms are:

- Random Forest
- Decision Trees
- Naïve Bayes Classifier
- Logistic Regression
- Support Vector Machine
In this practical we will be focusing on Naïve Bayes classifier.

Naïve Bayes Classification:

This classification algorithm is based on applying Bayes Theorem with strong (naïve)
independence assumptions between the features. Since its based on Bayes theorem, it is a
probabilistic classifier which means that it predicts on the basis of the probability of an
object.

Naïve Bayes algorithm is one of the simplest and most effective classification algorithms that
can produce fast models with high accuracy.

If we break down the name of the model:

1) Naive in statistics means assuming that the occurrence of a certain feature is


independent of the occurrence of other features.
2) Bayes refers to the principles of Bayes Theorem.
Bayes Theorem:

In probability theory, Bayes theorem is used to describe the probability of an event based on
prior knowledge of conditions that might be related to the event. It depends on the concept of
conditional probability.

The formula is given by:

Department of Information Technology, PRMIT&R, Badnera Page 67


Machine Learning Lab [7IT07]

P(A|B) = P(B|A)P(A)

P(B)

Where,

P(A|B) is Posterior Probability or probability of event A occurring when event B is true.

P(B|A) is Likelihood Probability or probability of event B occurring given the event A is true.

P(A) and P(B) are the probabilities of A and B individually without any conditions.

Advantages of Naïve Bayes Classifier:

- It is one of the fastest and easiest Machine Learning model for classification
problems.
- It can be used for binary as well as multi class classifications.
Disadvantages of Naïve Bayes Classifier:

- It disregards the relations between features as it considers each feature to be


independent.
Applications of Naïve Bayes Classifier:

- Text classification like: Sentiment Analysis and Spam Filtering


- Medical data classification
Types of Naïve Bayes Model:

- Gaussian: This model assumes that the features follow a gaussian or normal
distribution.
- Multinomial: This is used when the data is multinomial distributed. It is primarily
used for document classification problem, etc.
- Bernouli: It works similar to the Multinomial classifier, but predictor variables are the
independent Boolean variable.
- PROGRAM:
- #importing all packages
- import pandas as pd
- import numpy as np
- from sklearn.naive_bayes import GaussianNB
- from sklearn.model_selection import train_test_split
- from sklearn.metrics import confusion_matrix
- from sklearn.metrics import accuracy_score
-
- #Loading the Iris dataset using the 'read_csv()' function from pandas library

Department of Information Technology, PRMIT&R, Badnera Page 68


Machine Learning Lab [7IT07]

#The read_csv() function read the contents of a dataset stored in .csv or comma separated
values format

data = pd.read_csv("Iris.csv")

#Checking if the dataset was read properly

print(data.head())

#Our prediction target or dependent variable is the 'Species' column

#Separating the dependent variable and the independent variables using dataframe slicing
techniques from pandas library

X = data.iloc[:,1:5].values #Set of independent variables

y = data['Species'].values #Dependent variable

#Splitting the dataset into Training Set and Testing Set

#Using the 'train_test_split' function from sklear library

#The first parameter is the set of independent variables, the second parameter is the
dependent variable

#The 'test_size' parameter determines the size of the testing dataset. Here we will be using
20% of the dataset for testing

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

#Model Creating

#Creating an instance for Naive Bayes Classifier model

classifier = GaussianNB()

#fitting the training set into the model

classifier.fit(X_train, y_train)

Department of Information Technology, PRMIT&R, Badnera Page 69


Machine Learning Lab [7IT07]

#Predicting test results

y_predictions = classifier.predict(X_test)

#Printing the prediction results

print(y_predictions)

#Creating a confusion matrix to check the accuracy of our model

confusion_mat = confusion_matrix(y_test, y_predictions)

#Evaluating the accuracy score

accuracy = accuracy_score(y_test, y_predictions)

#Printing the accuracy score and confusion matrix

print("Accuracy: ", accuracy)

print("Confusion matrix:\n", confusion_mat)

OUTPUT:

Department of Information Technology, PRMIT&R, Badnera Page 70


Machine Learning Lab [7IT07]

CONCLUSION:

Signature
Department of Information Technology

Viva Questions

1) What mathematical concept Naive Bayes is based on? ...

2) What are the different types of Naive Bayes classifiers? ...


3) Is Naive Bias a classification algorithm or regression algorithm? ...
4) What are some benefits of Naive Bayes? ...
5) What are the cons of Naive Bayes classifier?

Department of Information Technology, PRMIT&R, Badnera Page 71

You might also like