7IT07

Vidarbha Youth Welfare Society’s
Prof. Ram Meghe Institute of Technology & Research
Badnera, Amravati (M.S.) 444701
Practical Record
Semester VII
(Subject code: 7IT07)
Subject: Machine Learning Lab
(Academic Year: 2022-2023)
Name of Student: ______________________________________________
Roll No: ____________ __________ Section: _______________________
Department of Information Technology

Phone: (0721) - 2580402/2681246
Fax No. 0721 – 2681337
Website : www.mitra.ac.in
Sant Gadge Baba Amravati University

Vidarbha Youth Welfare Society’s
Prof. Ram Meghe Institute of Technology & Research

Badnera, Amravati (M.S.) 444701
CERTIFICATE
This is to certify that Mr. / Miss___________________________________________

Enrollment No. ____________________ Roll No.__________ Section: _____ of B.E.
(IT) Semester VII has satisfactorily completed the term work of the subject
Machine Learning Lab (7IT07) prescribed by Sant Gadge Baba University
Amravati during the academic term 2022-2023.
Date: Signature of the Faculty

INDEX
Sr.No Title Page No.
1 Vision and Mission of the Institute and Program i
2 Program Educational Objectives and Program Outcomes ii
3 Syllabus, Course Learning Objectives and Course Outcomes iii
4 Assessment Format i.e., ACIPV(Guidelines) iv
5 Aim of the Lab Manual v
6 List of experiment vi
7 a.Title/Aim
b.Apparatus/Components used
c. Theory
d. Procedure/Steps
e. Observations
f. Output/Result
g. Conclusion.
h. Viva Questions
Machine Learning Lab [7IT07]
1. Vision and Mission of the Institute and Program:

 Mission & Vision statement of the Institute:
VISION
To become a pace-setting
Centre of excellence believing in three
Universal values namely
Synergy, Trust and Passion,
with zeal to serve the Nation
in the global scenario
MISSION
M1: To achieve the highest standard in technical education through the state-of-the-art
pedagogy and enhanced industry Institute linkages.
M2: To inculcate the culture of research in core and emerging areas.
M3: To strive for overall development of students so as to nurture ingenious technocrats as
well as responsible citizens.
 Mission & Vision statement of the Department of Information Technology:
VISION
Attaining growing needs of industry and society through
Information Technology with ethical values.
MISSION
• To become leading education center by inspiring the students to become competent IT

Engineers
• To make students more innovative and research oriented, to improve their ability to
provide appropriate support for industry and society
• To train students to adapt the life-long learning with ethical values.
 Quality Policy of the Institute
QUALITY POLICY
“Striving for Excellence in the Quality Engineering Education”
Department of Information Technology, PRMIT&R, Badnera Page i

2. Program Educational Objective & Program Outcomes (PEO’s & PO’s):

 Program Education Objectives (PEO’s)
PEO1: To provide a sustainable foundation in mathematics, science and engineering to enable them
to become competent IT Engineers
PEO2: To deliver a comprehensive education in Information Technology and related engineering
fields to ensure the core competency to be successful in Industry and in higher studies.
PEO3: To analyze, design and create solutions in alignment with industry and emerging trends in
multidisciplinary environments.
PEO4: To communicate effectively as a part of team for self-learning, lifelong learning and career
enhancement in industry and society.
 Program Outcomes (PO’s)
On completion of the course a graduate of Information Technology program will be able to:
PO1: Engineering Knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals and an engineering specialization to solution of complex engineering problems.
PO2: Problem Analysis: Identify, formulate, review research literature, and analyse complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences and engineering sciences.
PO3: Design/Development of Solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
PO4: Conduct Investigations of Complex Problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions for complex problems.
PO5: Modern Tool Usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities
with an understanding of the limitations.
PO6: The Engineer and Society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
PO7: Environment and Sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9: Individual and Team Work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.
PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long Learning: Recognize the need for, and have the preparation and ability to engage
in independent and lifelong learning in the broadest context of technological change.
 Program Specific Outcomes (PSOs)
PSO1: Apply core aspects of Information Technology, Networking, Internet of Things and Security
to pursue successful career in IT industry.
PSO2: Develop, analyze and find IT solutions through programming paradigm, Web Designing and
Cloud Computing.
Department of Information Technology, PRMIT&R, Badnera Page ii

3. Syllabus (Theory + Practical)

Course Number and Title : Machine Learning(7IT04/7IT07)
Credits : 4 (Theory +Lab)
Name of Faculty : Prof. S. N. Sarda
Course Type : Theory + Lab
Compulsory/ Elective : Elective
Teaching Methods : Lecture : 3 Hrs/ week.
Laboratory : 2 Hrs/ week.
Course Assessment: Exams : 2 Class Tests
Semester end examination by SGBAU
Grading Policy : 20% 2 class tests and 1 Improvement Test
80% Semester end examinations
25 Internal Marks + 25 Marks External Viva.
Unit wise Course Contents in brief:

UNIT 1: Machine Learning: The three different types of machine learning, Introduction to the
basic terminology and notations, A roadmap for building machine learning systems, Using
Python for machine learning, Training Simple Machine Learning Algorithms for
Classification, Artificial neurons – a brief glimpse into the early history of machine learning,
Implementing a perception learning algorithm in Python, Adaptive linear neurons and the
convergence of learning.
UNIT 2: A Tour of Machine Learning Classifiers Using scikit-learn, Choosing a classification
algorithm, First steps with scikit-learn – training a perceptron, Modeling class probabilities
via logistic regression, Maximum margin classification with support vector machines,
Solving nonlinear problems using a kernel SVM, Decision tree learning, K-nearest neighbors
– a lazy learning algorithm.
UNIT 3: Data Preprocessing, Hyperparameter Tuning: Building Good Training Sets, Dealing
with missing data, Handling categorical data, Partitioning a dataset into separate training and
test sets, Bringing features onto the same scale, Selecting meaningful features, Assessing
feature importance with random forests, Compressing Data via Dimensionality Reduction,
Unsupervised dimensionality reduction via principal component analysis, Supervised data
compression via linear discriminant analysis, Using kernel principal component analysis for
nonlinear mappings, Learning Best Practices for Model Evaluation and Hyperparameter
Tuning, Streamlining workflows with pipelines, Using k-fold cross-validation to assess
model performance.
UNIT 4: Regression Analysis: Predicting Continuous Target Variables, Introducing linear
regression, Exploring the Housing dataset, Implementing an ordinary least squares linear
regression model, Fitting a robust regression model using RANSAC, Evaluating the
performance of linear regression models, Using regularized methods for regression, Turning a
linear regression model into a curve – polynomial regression.
UNIT 5: Dealing with nonlinear relationships using random forests, Working with Unlabeled
Data – Clustering Analysis, Grouping objects by similarity using k-means, Organizing
clusters as a hierarchical tree, Locating regions of high density via DBSCAN.
Department of Information Technology, PRMIT&R, Badnera Page iii

UNIT 6: Multilayer Artificial Neural Network and Deep Learning: Modeling complex
functions with artificial neural networks, Classifying handwritten digits, Training an artificial
neural network, About the convergence in neural networks, A few last words about the neural
network implementation, Parallelizing Neural Network Training with Tensor Flow, Tensor
Flow and training performance.
Text Books
1. Sebastian Raschka, and Vahid Mirjalili ―Python Machine Learning: Machine
Learning and Deep Learning with Python, scikit-learn, and Tensor Flow.
Reference books:
1. Andriy Burkov ― The Hundred - Page ‘Machine Learning Book’.
2. Aurélien Géron ―’Hands-on Machine Learning’ with Scikit-Learn, Keras, and Tensor
Flow: Concepts, Tools, and Techniques to Build Intelligent Systems.
3. Andreas C. Müller & Sarah Guido ―Introduction to Machine Learning with Python: A
Guide for Data Scientists.
4. Chris Albon ―Machine Learning with Python Cookbook: Practical Solutions from
Preprocessingto Deep Learning
Units covered in the course

and contact hours per unit: Unit 1: 6 hrs Unit 2: 6 hrs Unit 3: 6 hrs
Unit 4: 6 hrs Unit 5: 6 hrs Unit 6: 6 Hrs
Department of Information Technology, PRMIT&R, Badnera Page iv

Assessment Format i.e., ACIPV(Guidelines):

Guidelines for Awarding Internal Marks for Practical
Each experiment/ practical carries 25 marks. The student shall be evaluated for 25 marks as
per the following guidelines. At the end of the semester, internal assessment marks for
practical shall be the average of marks scored in all the experiments.
a. Attendance (5 marks): These 5 marks will be given to the regularity of a student. If the
student is present on the scheduled time, he/ she will be awarded 5 marks. Otherwise, will
be given 2.5 marks for his attendance.
b. Competency (5 marks): Here the basic aim is to check whether the student has
developed the skill to write down the code on his own and debug. If a student executes
the practical on scheduled day within allocated time; he will be awarded full marks.
Otherwise, marks will be given according to level of completion/ execution of practical.
c. Innovation (5 marks): Here the basic objective is to explore the innovative ideas from
the students with respect to the corresponding practical or how innovatively he interprets
the aim of the practical. It is expected that the students must be aware about scope of
practical precisely such that in future, they could implement task in various applications.
d. Performance/Participation in-group activity (5 marks): These marks will be given on
how a student is actively participating the group. If the student is performing the practical
as the part of a group he must participate actively.
e. Viva-Voce (5 marks): These 5 marks will be totally based on the performance in viva-
voce. There shall be viva-voce on the completion of each practical in the same practical
session. The student shall get the record checked before next practical.
Assessment Sheet
Regular Assessment of Experiment during the practical session

(Sample Sheet)
Regular Assessment
Year: -
Branch: -
A C I P V Total
Roll no.
(5marks) (5marks) (5marks) (5marks) (5marks) (25Marks)
(A-Attendance, C- Competency, I-Innovation, P-Performance, V-Viva)
Department of Information Technology, PRMIT&R, Badnera Page v

1. Aim of the Lab manual:

This lab manual is designed for the Final year (VII Semester) students of Information
Technology in the Machine Learning (7IT04/7IT07). The main aim of this Manual is to
provide guidelines to students for smooth execution of lab sessions. It has been prepared as
per university curriculum & covers various aspects of Machine Learning.
Department of Information Technology, PRMIT&R, Badnera Page vi

 6.List of Experiments
Subject: Block- Machine Learning Lab Semester: VII
University Code: 7IT07
Sr. Date Page No. Remark

LIST OF EXPERIMENTS
No.
1 To Installing Anaconda and Python
2 To understand How to get datasets for

Machine Learning
3 To study of Data Preprocessing in Machine
learning
4 Write a python program to
compute
Central tendency measures: Mean, Median,

Mode
Measure of Dispersion: Variance, Standard
Deviation
5 To Study of Python Libraries for ML

application such as Pandas and Matplotlib.
6 Implement the FIND-S algorithm for finding

the most specific hypothesis based on a given
set of training data samples. Read the training
data from a .CSV file.
7 Write a Python program to implement

Simple Linear Regression.
8 To create and evaluate a Naïve Bayes
classification model for the Iris dataset.
Department of Information Technology, PRMIT&R, Badnera Page vii

Practical No. 1
Aim: To Installing Anaconda and Python
Software Required: Python
Theory: In order to use Python for machine learning, we need to install it on our computer
system with a compatible Integrated Development Environment (IDE).
Anaconda distribution is a free and open-source platform for Python/R programming

languages. It can be easily installed on any OS such as Windows, Linux, and MAC OS. It
provides more than 1500 Python/R data science packages which are suitable for developing
machine learning and deep learning models. Anaconda distribution provides installation of
Python with various IDE's such as Jupyter Notebook, Spyder, Anaconda prompt, etc. Hence
it is a very convenient packaged solution which you can easily download and install in your
computer. It will automatically install Python and some basic IDEs and libraries with it.
Below some steps are given to show the downloading and installing process of Anaconda and
IDE:
Step-1: Download Anaconda Python:
To download Anaconda in your system, firstly, open your favorite browser and type
Download Anaconda Python, and then click on the first link as given in the below image.
Alternatively, you can directly download it by clicking on this
link, https://www.anaconda.com/distribution/#download-section.
After clicking on the first link, you will reach to download page of Anaconda, as shown in
the below image:
Department of Information Technology, PRMIT&R, Badnera Page 1

Since, Anaconda is available for Windows, Linux, and Mac O.S., hence, you can download it
as per your OS type by clicking on available options shown in below image. It will provide
you Python 2.7 and Python 3.7 versions, but the latest version is 3.7, hence we will download
Python 3.7 version. After clicking on the download option, it will start downloading on your
computer.

Step- 2: Install Anaconda Python (Python 3.7 version):
Once the downloading process gets completed, go to downloads → double click on the ".exe"
file (Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will open a setup window
for Anaconda installations as given in below image, then click on Next.
It will open a License agreement window click on "I Agree" option and move further.

In the next window, you will get two options for installations as given in the below image.
Select the first option (Just me) and click on Next
Now you will get a window for installing location, here, you can leave it as default or change
it by browsing a location, and then click on Next. Consider the below image:
Now select the second option, and click on install.

Once the installation gets complete, click on Next.
Now installation is completed, tick the checkbox if you want to learn more about Anaconda
and Anaconda cloud. Click on Finish to end the process.

Note: Here, we will use the Spyder IDE to run Python programs.
Step- 3: Open Anaconda Navigator

o After successful installation of Anaconda, use Anaconda navigator to launch a Python
IDE such as Spyder and Jupyter Notebook.
o To open Anaconda Navigator, click on window Key and search for Anaconda
navigator, and click on it. Consider the below image:
o After opening the navigator, launch the Spyder IDE by clicking on the Launch button
given below the Spyder. It will install the Spyder IDE in your system.

Run your Python program in Spyder IDE.
Open Spyder IDE, it will look like the below image:
Write your first program, and save it using the .py extension.
Run the program using the triangle Run button.
You can check the program's output on console pane at the bottom right side.

Conclusion:
Viva Question:
1) Why was Machine Learning Introduced? ...

2) What is Machine Learning?
3) What is Anaconda?
4) What are Different Types of Machine Learning algorithms
Signature

Practical No. 2
Aim: To understand How to get datasets for Machine Learning
Software required: Python
Theory:
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an example
of the dataset
Country Age Salary Purchased
India 38 48000 No
France 43 45000 Yes
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
India 35 58000 Yes
A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.
Types of data in datasets
o Numerical data:Such as house price, temperature, etc.

o Categorical data:Such as Yes/No, True/False, Blue/green, etc.
o Ordinal data:These data are similar to categorical data but can be measured on the
basis of comparison.
Note: A real-world dataset is of huge size, which is difficult to manage and process at the
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset.

Need of Dataset
To work with machine learning projects, we need a huge amount of data, because, without
the data, one cannot train ML/AI models. Collecting and preparing the dataset is one of the
most crucial parts while creating an ML/AI project.
The technology applied behind any ML projects cannot work properly if the dataset is not
well prepared and pre-processed.
During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:
o Training dataset:
o Test Dataset
Popular sources for Machine Learning datasets
Below is the list of datasets which are freely available for the public to work on it:

Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve difficult
Data Science related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find and
download.
2. UCI Machine Learning Repository
UCI Machine learning repository is one of the great sources of machine learning datasets.
This repository contains databases, domain theories, and data generators that are widely used
by the machine learning community for the analysis of ML algorithms.

Since the year 1987, it has been widely used by students, professors, researchers as a primary
source of machine learning dataset.
It classifies the datasets as per the problems and tasks of machine learning such
as Regression, Classification, Clustering, etc. It also contains some of the popular datasets
such as the Iris dataset, Car Evaluation dataset, Poker Hand dataset, etc.
3. Datasets via AWS
We can search, download, access, and share the datasets that are publicly available via AWS
resources. These datasets can be accessed through AWS resources but provided and
maintained by different government organizations, researches, businesses, or individuals.
Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.
This source provides the various types of datasets with examples and ways to use the dataset.
It also provides the search box using which we can search for the required dataset. Anyone
can add any dataset or example to the Registry of Open Data on AWS.

4. Google's Dataset Search Engine
Google dataset search engine is a search engine launched by Google on September 5,

2018. This source helps researchers to get online datasets that are freely available for use.
5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing, computer
vision, and domain-specific sciences.
Using this resource, we can download the datasets to use on the current device, or we can
also directly use it on the cloud infrastructure.
6. Awesome Public Dataset Collection

Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is
better to check the license before downloading the dataset.
7. Government Datasets
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.
The goal of providing these datasets is to increase transparency of government work among
the people and to use the data in an innovative approach. Below are some links of
government datasets:
o Indian Government dataset(https://data.gov.in/)

o US Government Dataset(https://www.data.gov/)
o Northern Ireland Public Sector Datasets
o European Union Open Data Portal

8. Computer Vision Datasets
Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you can
refer to this source.
9. Scikit-learn dataset
Scikit-learn is a great source for machine learning enthusiasts. This source provides both toy
and real-world datasets. These datasets can be obtained from sklearn.datasets package and
using general dataset API.

The toy dataset available on scikit-learn can be loaded using some predefined functions such
as, load_boston([return_X_y]), load_iris([return_X_y]), etc, rather than importing any file
from external sources. But these datasets are not suitable for real-world projects.
Conclusion:
Viva Question:
1) What is Data Set?

2) What is CSV?
3) How data Set are created?
4) What is the Need of Data Set?
5) What is the use of Scikit Learn?
Signature

Practical No. 3
Aim: To study of Data Preprocessing in Machine learning
Software required: Python, Anaconda
Theory:
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data preprocessing task.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
It involves below steps:
o Getting the dataset

o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.

What is a CSV File?
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets
in programs.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:
 Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices. So, in
Python, we can import it as:
import numpy as np
Here we have used np, which is a short name for Numpy, and it will be used in the whole
program.
 Matplotlib: The second library is matplotlib, which is a Python 2D plotting library,

and with this library, we need to import a sub-library pyplot. This library is used to
plot any type of charts in Python for the code. It will be imported as below:
import matplotlib.pyplot as mpt
Here we have used mpt as a short name for this library.
 Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow the
below steps:
1. Save your Python file in the directory which contains dataset.

2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.
Note: We can set any directory as a working directory, but it must contain the required
dataset.
Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.
We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in
Python. We can also change the format of our dataset by clicking on the format
option.
 Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent
variables) and dependent variables from dataset. In our dataset, there are three
Independent variables that are Country, Age, and Salary, and one is a dependent
variable which is Purchased.
 Extracting independent variable:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is

used to extract the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is
for all the columns. Here we have used :-1, because we don't want to take the last column
as it contains the dependent variable. So by doing this, we will get the matrix of features.

By executing the above code, we will get output as
4) Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.
Ways to handle missing data:
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
#handling missing data (Replacing missing data with the mean value)
1. from sklearn.preprocessing import Imputer
2. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
3. #Fitting imputer object to the independent variables x.
4. imputerimputer= imputer.fit(x[:, 1:3])
5. #Replacing missing data with the calculated mean value
6. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
array([['India', 38.0, 68000.0],

['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object
5) Encoding Categorical data:
Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.
Conclusion: -
Viva Question:
1). What is a CSV File?

2). How to Import Data Set?
3). What is the use of Matplot Lib?
4). What is the use of NumPy?
5). How to Extract independent variable.
Signature

Practical No. 4
Aim: Write a python program to compute
a) Central tendency measures: Mean, Median, Mode

b) Measure of Dispersion: Variance, Standard Deviation
Objective:
• To understand the programming constructs in python
• To understand basic statistics concepts and implement its formulae using
python
Software required: Python
Theory:
Numeric Types:
Integers:
In Python 3, there is effectively no limit to how long an integer value
can be. Of course, it is constrained by the amount of memory your
system has.
>>>
print(10) 10
>>> type(10)

Floating Point Numbers:

The float type in Python designates a floating-point number. float values are specified with a
decimal point. Optionally, the character e or E followed by a positive or negative integer may
be appended to specify scientific notation
>>> 4.2
4.2
>>>
.4e7
4000
000.
0
Complex Numbers
>>> 2+3j
(2+3j)
>>> type(2+3j)
Complex numbers are specified as <real part>+<imaginary part>j.
Strings:
Strings are sequences of character data. The string type in Python is called str. String literals
may be delimited using either single or double quotes. All the characters between the opening
delimiter and matching closing delimiter are part of the string.
A string in Python can contain as many characters as you wish. The only limit is your
>>> print("I am a
string.") I am a string.
>>> type("I am a string.")
<class 'str'>
machine‘s memory resources. A string can also be empty.

A raw string literal is preceded by r or R, which specifies that escape sequences in the
>>>
print('foo\nbar')
foo
associated string are not translated. The backslash character is left in the string.
>>>
print(r'foo\nbar')
foo\nbar
>>>
print('foo\\bar')
Boolean Type:
Python 3 provides a Boolean data type. Objects of Boolean type may have one of two values,
>>> type(True)
<class 'bool'>
>>> type(False)
True or False.
Python List:
List is an ordered sequence of items. It is one of the most used datatype in Python and is very
flexible. All the items in a list do not need to be of the same type. Declaring a list is pretty
straight forward. Items separated by commas are enclosed within brackets [ ]. We can use the
slicing operator [ ] to extract an item or a range of items from a list. Index starts form 0 in
Python.
Lists are mutable, meaning, value of elements of a list can be altered.
>> a = [1, 2.2,

> 'python']

Python Tuple:
Tuple is an ordered sequence of items same as list. The only difference is that tuples are
immutable. Tuples once created cannot be modified. Tuples are used to write-protect data and
are usually faster than list as it cannot change dynamically. It is defined within parentheses ()
where items are separated by commas. We can use the slicing operator [] to extract items but
>>> t = (5,'program', 1+3j)

we cannot change its value.
Python Set:
Set is an unordered collection of unique items. Set is defined by values separated by comma
inside braces { }. Items in a set are not ordered. We can perform set operations like union,
intersection on two sets. Set have unique values. They eliminate duplicates. Since, set are
unordered collection, indexing has no meaning. Hence the slicing operator [] does not work.
>>> a =
{1,2,2,3,3,3}
>>>
a
{1, 2,
3}
Python Dictionary:
Dictionary is an unordered collection of key-value pairs. It is generally used when we have a

huge amount of data. Dictionaries are optimized for retrieving data. We must know the key to
retrieve the value. In Python, dictionaries are defined within braces {} with each item being a
pair in the form key:value. Key and value can be of any type. We use key to retrieve the
respective value. But not the other way around.
>>> d =
{1:'value','key':2}
>>>
type(d)
<class
'dict'>
Operators in Python
Operators are used to perform operations on variables and values. Python divides the
operators in the following groups:
1. Arithmetic operators
Arithmetic operators are used to perform mathematical operations like addition, subtraction,
multiplication and division.

+ Addition: adds two operands x + y

- Subtraction: subtracts two operands x - y
* Multiplication: multiplies two operands x*y
/ Division (float): divides the first operand by the second x/y
// Division (floor): divides the first operand by the second x // y
% Modulus: returns the remainder when first operand is divided by the second
x%y
2. Comparison/Relational operators
Relational operators compares the values. It either returns True or False
according to the condition.
> Greater than: True if left operand is greater than the right x>y
< Less than: True if left operand is less than the right x < y
== Equal to: True if both operands are equal x == y
!= Not equal to - True if operands are not equal x != y
>= Greater than or equal to: True if left operand is greater than or equal to the
right x >= y
<= Less than or equal to: True if left operand is less than or equal to the right x<=
y
3. Logical operators
Logical operators perform Logical AND, Logical OR and Logical NOT operations.
and Logical AND: True if both the operands are true x and y
or Logical OR: True if either of the operands is true x or y
not Logical NOT: True if operand is false not x
Control Statements in Python
1. Python Decision Making Statements

2. Python Loops Statements
3. Loop Control Statements
Executing Python Program
This describes the environment in which Python programs are executed. This describes the
runtime behavior of the interpreter, including program startup, configuration, and program
termination
Anaconda Navigator – Jupyter Notebook
Anaconda is a free and open-source distribution of the Python and R programming languages
for scientific computing, machine learning and data science that aims to simplify package
management and deployment.
The notebook extends the console-based approach to interactive computing in a qualitatively

new direction, providing a web-based application suitable for capturing the whole
computation process: developing, documenting, and executing code, as well as
communicating the results. A notebook kernel is a ―computational engine‖ that executes
the code contained in a Notebook document. The ipython kernel, referenced in this guide,
executes python code.
The Jupyter notebook combines two components:
A web application: a browser-based tool for interactive authoring of documents which

combine explanatory text, mathematics, computations and their rich media output.

Notebook documents: a representation of all content visible in the web application, including
inputs and outputs of the computations, explanatory text, mathematics, images, and rich
media representations of objects.
PyCharm IDE
PyCharm is an integrated development environment (IDE) used in computer programming,
specifically for the Python language. It is developed by the JetBrains. It provides code
analysis, a graphical debugger, an integrated unit tester, integration with version control
systems, and supports web development with Django as well as Data Science with Anaconda
Google COLAB
Colaboratory is a research tool for machine learning education and research. It‘s a Jupyter
notebook environment that requires no setup to use. Google Colab is a free cloud service and
now it supports free GPU! You can: improve your Python programming language coding
skills. To start working with Colab you first need to log in to your google account, then go to
this link https://colab.research.google.com.
Central Tendency Measures
A measure of central tendency (also referred to as measures of centre or central location) is a

summary measure that attempts to describe a whole set of data with a single value that
represents the middle or centre of its distribution.
There are three main measures of central tendency: the mode, the median and the mean. Each
of these measures describes a different indication of the typical or central value in the
distribution.
Mean
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average. Looking at the retirement age
distribution again:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values

(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of observations
(11) which equals 56.6 years.
Median
The median is the middle value in distribution when the values are arranged in ascending or
descending order.

In a distribution with an odd number of observations, the median value is the middle value.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Looking at the retirement age distribution (which has 11 observations), the median is the
middle value, which is 57 years.
When the distribution has an even number of observations, the median value is the mean of
the two middle values. In the following distribution,
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The middle two values are 56 and 57; therefore the median equals 56.5 years.
Mode
The mode is the most commonly occurring value in a distribution. Consider this dataset
showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
Measure of Dispersion
Measures of spread describe how similar or varied the set of observed values are for a
particular variable (data item). Measures of spread include the range, quartiles and the
interquartile range, variance and standard deviation. The spread of the values can be
measured for quantitative data, as the variables are numeric and can be arranged into a logical
order with a low end value and a high end value.
Variance and Standard Deviation
The variance and the standard deviation are measures of the spread of the data around the
mean. They summarise how close each observed data value is to the mean value.
In datasets with a small spread all values are very close to the mean, resulting in a small
variance and standard deviation. Where a dataset is more dispersed, values are spread further
away from the mean, leading to a larger variance and standard deviation.
The smaller the variance and standard deviation, the more the mean value is indicative of the
whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and
variance are zero.

Programming Steps:
1. Take input from user (list elements)

2. Find the size of list
3. Sort the given lise in ascending or descending form
4. Calculate mean, median and mode of the list
5. Calculate the variance and standard deviation
Program:



Conclusion: -
Signature
Viva Question:
1). What are the measures of central tendency?

2). What is the mode?
3). What is the median?
4). What is the mean?
5) which is the best meaure of centeral Tendency

Practical No. 5
Aim: To Study of Python Libraries for ML application such as Pandas and Matplotlib.
Objective:
 To understand data preprocessing and analysis using Pandas library
 To understand data visualization in the form of 2D graphs and plots using
Matplotlib library
Software Required: Python3.0

Theory: List important ML libraries
Python Libraries for Machine Learning
Numpy
Scipy
Scikit-learn
Theano
TensorFlow
Keras
PyTorch
Pandas
Matplotlib
Importance of Pandas library
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data

structures and data analysis tools for the Python programming language.
Pandas makes importing, analyzing, and visualizing data much easier. It builds on packages
like NumPy and matplotlib to give you a single, convenient, place to do most of your data
analysis and visualization work.

Advantages of Pandas Library
There are many benefits of Python Pandas library, listing them all would probably take more
time than what it takes to learn the library. Therefore, these are the core advantages of using
the Pandas library:
1) Data representation
Pandas provide extremely streamlined forms of data representation. This helps to analyze and
understand data better. Simpler data representation facilitates better results for data science
projects.
2) Less writing and more work done
It is one of the best advantages of Pandas. What would have taken multiple lines in Python
without any support libraries, can simply be achieved through 1-2 lines with the use of
Pandas.
Thus, using Pandas helps to shorten the procedure of handling data. With the time saved, we
can focus more on data analysis algorithms.
3) An extensive set of features
Pandas are really powerful. They provide you with a huge set of important commands and
features which are used to easily analyze your data. We can use Pandas to perform various
tasks like filtering your data according to certain conditions, or segmenting and segregating
the data according to preference, etc.
4) Efficiently handles large data
Wes McKinney, the creator of Pandas, made the python library to mainly handle large
datasets efficiently. Pandas help to save a lot of time by importing large amounts of data very
fast.
5) Makes data flexible and customizable
Pandas provide a huge feature set to apply on the data you have so that you can customize,
edit and pivot it according to your own will and desire. This helps to bring the most out of
your data.
6) Made for Python
Python programming has become one of the most sought after programming languages in the
world, with its extensive amount of features and the sheer amount of productivity it provides.
Therefore, being able to code Pandas in Python, enables you to tap into the power of the
various other features and libraries which will use with Python. Some of these libraries are
NumPy, SciPy, MatPlotLib, etc.
Pandas Library

The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a

collection of Series.
DataFrames and Series are quite similar in that many operations that you can do with one you
can do with the other, such as filling in null values and calculating the mean.
Reading data from CSVs
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv') df
Let's load in the IMDB movies dataset to begin:
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
We're loading this dataset from a CSV and designating the movie titles to be our index.
Viewing your data
The first thing to do when opening a new dataset is print out a few rows to keep as a visual
reference. We accomplish this with .head():
movies_df.head()
Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):
movies_df.shape
Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we
have 1000 rows and 11 columns in our movies DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. For example, you might
filter some rows based on some criteria and then want to know quickly how many rows were
removed.

Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you aren't
aggregating duplicate rows.
To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:
temp_df =
movies_df.append(movies_d
f) temp_df.shape
Out: (2000, 11)
Using append()will return a copy without affecting the original DataFrame. We are capturing
this copy in tempso we aren't working with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled. Now we can try
dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append(), the drop_duplicates() method will also return a copy of your DataFrame,
but this time with duplicates removed. Calling .shapeconfirms we're back to the 1000 rows of
our original dataset.
It's a little verbose to keep assigning DataFrames to the same variable like in this example.
For this reason, pandas has the inplace keyword argument on many of its methods. Using
inplace=Truewill modify the DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Now our temp_dfwill have the transformed data automatically.
Another important argument for drop_duplicates()is keep, which has three possible options:
first: (default) Drop duplicates except for the first occurrence.
last: Drop duplicates except for the last occurrence.
False: Drop all duplicates.

Since we didn't define the keep arugment in the previous example it was defaulted to first.
This means that if two rows are the same pandas will drop the second row and keep the first
row. Using lasthas the opposite effect: the first row is dropped.
keep, on the other hand, will drop all duplicates. If two rows are the same then both will be
dropped. Watch what happens to temp_df:
temp_df = movies_df.append(movies_df) # make a new copy

temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left
over. If you're wondering why you would want to do this, one reason is that it allows you to
locate all duplicates in your dataset. When conditional selections are shown below you'll see
how to do that.
Column cleanup
Many times datasets will have verbose column names with symbols, upper and lowercase
words, spaces, and typos. To make selecting data by column name easier we can spend a little
time cleaning up their names.
Here's how to print the column names of our dataset:
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime (Minutes)', 'Rating',

'Votes', 'Revenue (Millions)', 'Metascore'],
dtype='object')
Not only does .columnscome in handy if you want to rename columns by allowing for simple
copy and paste, it's also useful if you need to understand why you are receiving a Key Error
when selecting data by column.
We can use the .rename() method to rename certain or all columns via a dict. We don't want
parentheses, so let's rename those:
movies_df.rename(columns={

'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'], dtype='object')
Excellent. But what if we want to lowercase all names? Instead of using .rename() we could
also set a list of names to the columns like so:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'], dtype='object')
But that's too much work. Instead of just renaming each column manually we can do a list
comprehension:
movies_df.columns = [col.lower() for col in movies_df] movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'], dtype='object')
list (and dict) comprehensions come in handy a lot when working with pandas and data in
general.
It's a good idea to lowercase, remove special characters, and replace spaces with underscores
if you'll be working with a dataset for some time.
Importance of Matplotlib library
To make necessary statistical inferences, it becomes necessary to visualize your data and
Matplotlib is one such solution for the Python users. It is a very powerful plotting library
useful for those working with Python and NumPy. The most used module of Matplotib is

Pyplot which provides an interface like MATLAB but instead, it uses Python and it is open
source.
General Concepts:
A Matplotlib figure can be categorized into several parts as below:
Figure: It is a whole figure which may contain one or more than one axes (plots). You can
think of a Figure as a canvas which contains plots.
Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It contains
two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.
Axis: They are the number line like objects and take care of generating the graph limits.
Artist: Everything which one can see on the figure is an artist like Text objects, Line2D
objects, collection objects. Most Artists are tied to Axes.
Matplotlib Library
Pyplot is a module of Matplotlib which provides simple functions to add plot elements like
lines, images, text, etc. to the current axes in the current figure.
Make a simple plot
import matplotlib.pyplot as plt import numpy as np
List of all the methods as they appeared.
plot(x-axis values, y-axis values) — plots a simple line graph with x-axis values against y-
axis values
show() — displays the graph
title(―string‖) — set the title of the plot as specified by the string
xlabel(―string‖) — set the label for x-axis as specified by the string
ylabel(―string‖) — set the label for y-axis as specified by the string
figure() — used to control a figure level attributes
subplot(nrows, ncols, index) — Add a subplot to the current figure
suptitle(―string‖) — It adds a common title to the figure specified by the string
subplots(nrows, ncols, figsize) — a convenient way to create subplots, in a single call. It

returns a tuple of a figure and number of axes.
set_title(―string‖) — an axes level method used to set the title of subplots in a figure

bar(categorical variables, values, color) — used to create vertical bar graphs
barh(categorical variables, values, color) — used to create horizontal bar graphs
legend(loc) — used to make legend of the graph
xticks(index, categorical variables) — Get or set the current tick locations and labels of the x-
axis
pie(value, categorical variables) — used to create a pie chart
hist(values, number of bins) — used to create a histogram
xlim(start value, end value) — used to set the limit of values of the x-axis
ylim(start value, end value) — used to set the limit of values of the y-axis
scatter(x-axis values, y-axis values) — plots a scatter plot with x-axis values against y-axis
values
axes() — adds an axes to the current figure
set_xlabel(―string‖) — axes level method used to set the x-label of the plot specified as a string
set_ylabel(―string‖) — axes level method used to set the y-label of the plot specified as a string
scatter3D(x-axis values, y-axis values) — plots a three-dimensional scatter plot with x-axis
values against y-axis values
plot3D(x-axis values, y-axis values) — plots a three-dimensional line graph with x- axis
values against y-axis values
Here we import Matplotlib‘s Pyplot module and Numpy library as most of the data that we
will be working with will be in the form of arrays only.

We pass two arrays as our input arguments to Pyplot‘s plot() method and use show() method
to invoke the required plot. Here note that the first array appears on the x-axis and second
array appears on the y-axis of the plot. Now that our first plot is ready, let us add the title, and
name x-axis and y-axis using methods title(), xlabel() and ylabel() respectively.
We can also specify the size of the figure using method figure()and passing the values as a
tuple of the length of rows and columns to the argument figsize

With every X and Y argument, you can also pass an optional third argument in the form of a
string which indicates the colour and line type of the plot. The default format is b- which
means a solid blue line. In the figure below we use go which means green circles. Likewise,
we can make many such combinations to format our plot.
We can also plot multiple sets of data by passing in multiple sets of arguments of X and Y
axis in the plot()method as shown.

Multiple plots in one figure:
We can use subplot() method to add more than one plots in one figure. In the image below,
we used this method to separate two graphs which we plotted on the same axes in the
previous example. The subplot() method takes three arguments: they are nrows, ncols and
index. They indicate the number of rows, number of columns and the index number of the
sub-plot. For instance, in our example, we want to create two sub-plots in one figure such that
it comes in one row and in two columns and hence we pass arguments (1,2,1)and (1,2,2)in the
subplot() method. Note that we have separately used title()method for both the subplots. We
use suptitle() method to make a centralized title for the figure.
If we want our sub-plots in two rows and single column, we can pass arguments (2,1,1)
and (2,1,2)

The above way of creating subplots becomes a bit tedious when we want many subplots in
our figure. A more convenient way is to use subpltots() method. Notice the difference of ‘s’
in both the methods. This method takes two arguments nrows and ncols as number of rows
and number of columns respectively. This method creates two objects:figure and axes which
we store in variables fig and ax which can be used to change the figure and axes level
attributes respectively. Note that these variable names are chosen arbitrarily


Program:


Conclusion:-
Signature
Viva Question:
1) What is the different Python Library?
2) What is the use of Scipy?
3) What is the difference between list and tuples in Python?
4) What type of language is python? ...
5) What are local & Global variables in Python?

Practical No: 6
Aim: Implement the FIND-S algorithm for finding the most specific hypothesis based on a
given set of training data samples. Read the training data from a .CSV file.
Create Excel file Weather.csv and save it in same path
Time Weather Temperature Company Humidity Wind Goes
Morning Sunny Warm Yes Mild Strong Yes
Evening Rainy Cold No Mild Normal No
Morning Sunny Moderate Yes Normal Normal Yes
Evening Sunny Cold Yes High Strong Yes
Software required: Python 3.0, anaconda.
Theory: In Machine Learning, concept learning can be termed as “a problem of searching

through a predefined space of potential hypothesis for the hypothesis that best fits the training
examples” – Tom Mitchell. In this article, we will go through one such concept learning
algorithm known as the Find-S algorithm. The find-S algorithm is a basic concept learning
algorithm in machine learning. The find-S algorithm finds the most specific hypothesis
that fits all the positive examples. We have to note here that the algorithm considers only
those positive training example. The find-S algorithm starts with the most specific
hypothesis and generalizes this hypothesis each time it fails to classify an observed positive
training data. Hence, the Find-S algorithm moves from the most specific hypothesis to the
most general Hypothesis.
In order to understand Find-S algorithm, you need to have a basic idea of the following
concepts as well:
Concept Learning
General Hypothesis
Specific Hypothesis
1. Concept Learning
To understand The concept learning with a real-life example. Most of human learning is
based on past instances or experiences. For example, we are able to identify any type of
vehicle based on a certain set of features like make, model, etc., that are defined over a large
set of features.

These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.
Similar to this, machines can also learn from concepts to identify whether an object belongs
to a specific category or not. Any algorithm that supports concept learning requires the
following:
• Training Data
• Target Concept
• Actual Data Objects
2. General Hypothesis
Hypothesis, in general, is an explanation for something. The general hypothesis basically

states the general relationship between the major variables. For example, a general hypothesis
for ordering food would be I want a burger.
G = { ‘?’, ‘?’, ‘?’, …..’?’}
3. Specific Hypothesis
The specific hypothesis fills in all the important details about the variables given in the
general hypothesis. The more specific details into the example given above would be I want a
cheeseburger with a chicken pepperoni filling with a lot of lettuce.
S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}
The Find-S algorithm follows the steps written below:
1. Initialize ‘h’ to the most specific hypothesis.

2. The Find-S algorithm only considers the positive examples and eliminates negative
examples. For each positive example, the algorithm checks for each attribute in the
example. If the attribute value is the same as the hypothesis value, the algorithm
moves on without any changes. But if the attribute value is different than the
hypothesis value, the algorithm changes it to ‘?’.
Now that we are done with the basic explanation of the Find-S algorithm, let us take a look at
how it works.
Looking at the data set, we have six attributes and a final attribute that defines the positive or
negative example. In this case, yes is a positive example, which means the person will go for
a walk.
So now, the general hypothesis is:
h0 = {‘Morning’, ‘Sunny’, ‘Warm’, ‘Yes’, ‘Mild’, ‘Strong’}

This is our general hypothesis, and now we will consider each example one by one, but only
the positive examples.
h1= {‘Morning’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}
h2 = {‘?’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}
Program:

Output:

Conclusion: -
Signature
Viva Question:
1) What is the purpose of Find S- Algorithm

2) What is Hypothesis?
3) What is the basic idea of Find S-algorithm?
4) Explain General Hypothesis
5) Explain Specific Hypothesis.

Practical No: 7
Aim: Write a Python program to implement Simple Linear Regression
Objective:
To understand the concept of simple linear regression
To apply simple linear regression on actual dataset to do prediction
Software required: Python 3.0, Anaconda
Theory:
Types of Learning
A machine is said to be learning from past Experiences(data feed in) with respect to some
class of Tasks, if it‘s Performance in a given Task improves with the Experience.
1) Supervised Learning
How it works: This algorithm consist of a target / outcome variable (or dependent variable)
which is to be predicted from a given set of predictors (independent variables). Using these
set of variables, we generate a function that map inputs to desired outputs. The training
process continues until the model achieves a desired level of accuracy on the training data.
Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN,
Logistic Regression etc.
2) Unsupervised Learning
How it works: In this algorithm, we do not have any target or outcome variable to predict /
estimate. It is used for clustering population in different groups, which is widely used for
segmenting customers in different groups for specific intervention. Examples of
Unsupervised Learning: Apriori algorithm, K-means.

3) Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make specific decisions. It
works this way: the machine is exposed to an environment where it trains itself continually
using trial and error. This machine learns from past experience and tries to capture the best
possible knowledge to make accurate business decisions. Example of Reinforcement
Learning: Markov Decision Process
4) Semi-Supervised Learning
In this type of learning, the algorithm is trained upon a combination of labeled and unlabeled
data. Typically, this combination will contain a very small amount of labeled data and a very
large amount of unlabeled data. The basic procedure involved is that first, the programmer
will cluster similar data using an unsupervised learning algorithm and then use the existing
labeled data to label the rest of the unlabeled data.
o Supervised Learning – Train Me!

o Unsupervised Learning – I am self sufficient in learning
o Reinforcement Learning – My life My rules! (Hit & Trial)

Types of Supervised Learning
Classification : It is a Supervised Learning task where output is having defined

labels(discrete value). For example in above Figure A, Output – Purchased has defined labels
i.e. 0 or 1 ; 1 means the customer will purchase and 0 means that customer won‘t purchase.
The goal here is to predict discrete values belonging to a particular class and evaluate on the
basis of accuracy. It can be either binary or multi class classification. In binary classification,
model predicts either 0 or 1 ; yes or no but in case of multi class classification, model
predicts more than one class.
Example: Gmail classifies mails in more than one classes like social, promotions, updates,
forum.
Regression : It is a Supervised Learning task where output is having continuous value.
Example in above Figure B, Output – Wind Speed is not having any discrete value but is
continuous in the particular range. The goal here is to predict a value as much closer to actual
output value as our model can and then evaluation is done by calculating error value. The
smaller the error the greater the accuracy of our regression model.
Example of Supervised Learning Algorithms:
• Linear Regression
• Nearest Neighbor
• Guassian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest
Regression Analysis
Regression analysis is a powerful statistical method that allows you to examine the
relationship between two or more variables of interest. While there are many types of
regression analysis, at their core they all examine the influence of one or more independent
variables on a dependent variable.

In order to understand regression analysis fully, it‘s essential to comprehend the following
terms:
Dependent Variable: This is the main factor that you‘re trying to understand or predict.
Independent Variables: These are the factors that you hypothesize have an impact on your
dependent variable.
Regression analysis is a form of predictive modelling technique which investigates the

relationship between a dependent (target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding the causal effect
relationship between the variables. For example, relationship between rash driving and
number of road accidents by a driver is best studied through regression. Regression analysis
is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data
points, in such a manner that the differences between the distances of data points from the
curve or line is minimized.
There are multiple benefits of using regression analysis. They are as follows:
It indicates the significant relationships between dependent variable and independent

variable.
It indicates the strength of impact of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different
scales, such as the effect of price changes and the number of promotional activities. These
benefits help market researchers / data analysts / data scientists to eliminate and evaluate the
best set of variables to be used for building predictive models.
There are various kinds of regression techniques available to make predictions. These
techniques are mostly driven by three metrics (number of independent variables, type of
dependent variables and shape of regression line).

Types of Regression:
• Linear Regression
• Logistic Regression
• Polynomial Regression
• Stepwise Regression
• Ridge Regression
Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually among
the first few topics which people pick while learning predictive modeling. In this technique,
the dependent variable is continuous, independent variable(s) can be continuous or discrete,
and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or more
independent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e

is error term. This equation can be used to predict the value of target variable based on given
predictor variable(s).
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish relationship between independent and dependent
variables by fitting a best line. This best fit line is known as regression line and represented
by a linear equation
Y= a *X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us
say, you ask a child in fifth grade to arrange people in his class by increasing order of weight,
without asking them their weights! What do you think the child will do? He / she would
likely look (visually analyze) at the height and build of people and arrange them using a
combination of these visible parameters. This is linear regression in real life! The child has
actually figured out that height and build would be correlated to the weight by a relationship,
which looks like the equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference of
distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having linear equation
y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person
Types of Linear Regression
Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear
Regression. Simple Linear Regression is characterized by one independent variable. And,
Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1)
independent variables. While finding the best fit line, you can fit a polynomial or curvilinear
regression. And these are known as polynomial or curvilinear regression.
The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear regression
has only 1 independent variable.
Simple Linear Regression
In simple linear regression, each observation consists of two values. One value is for the
dependent variable and one value is for the independent variable.
Simple Linear Regression Analysis The simplest form of a regression analysis uses on
dependent variable (y) and one independent variable (x). In this simple model, a straight line
approximates the relationship between the dependent variable and the independent variable.
The simple linear regression equation is represented like this: Ε(y) = (β0 +β1 x). The simple
linear regression equation is graphed as a straight line.
( β0 is the y intercept of the regression line. β1 is the slope.)
Ε(y) is the mean or expected value of y for a given value of x.
A regression line can show a positive linear relationship, a negative linear relationship, or no
relationship. If the graphed line in a simple linear regression is flat (not sloped), there is no
relationship between the two variables. If the regression line slopes upward with the lower

end of the line at the y intercept (axis) of the graph, and the upper end of line extending
upward into the graph field, away from the x intercept (axis) a positive linear relationship
exists. If the regression line slopes downward with the upper end of the line at the y intercept
(axis) of the graph, and the lower end of line extending downward into the graph field,
toward the x intercept (axis) a negative linear relationship exists.
Formulae to calculate b0 and b1 coefficients.
With simple linear regression we want to model our data as follows:
y = B0 + B1 * x
This is a line where y is the output variable we want to predict, x is the input variable we
know and B0 and B1 are coefficients that we need to estimate that move the line around.
Technically, B0 is called the intercept because it determines where the line intercepts the y-
axis. In machine learning we can call this the bias, because it is added to offset all predictions
that we make. The B1 term is called the slope because it defines the slope of the line or how x
translates into a y value before we add our bias.
The goal is to find the best estimates for the coefficients to minimize the errors in predicting
y from x.
Simple regression is great, because rather than having to search for values by trial and error
or calculate them analytically using more advanced linear algebra, we can estimate them
directly from our data.
Program:


Output:
Conclusion: -
Signature
Viva Question:
1) What is linear regression?
2) What is the use of regularisation?
3) How to choose the value of the parameter learning rate (α)?

Practical No: 8
Aim: To create and evaluate a Naïve Bayes classification model for the Iris dataset.
Software required: Python3.0, anaconda,jupiter
Theory:
Classification Algorithms:
Classification Algorithm is a type of Supervised Machine Learning technique. Classification

algorithms are used to predict the category of new data based on the training data. These
algorithms are used to classify new observations into different groups. In other words the
output variable must be divisible into two or more classes like: Yes-No, Male-Female, Spam-
Not spam etc.
Various kinds of classification algorithms are:
- Random Forest
- Decision Trees
- Naïve Bayes Classifier
- Logistic Regression
- Support Vector Machine
In this practical we will be focusing on Naïve Bayes classifier.
Naïve Bayes Classification:
This classification algorithm is based on applying Bayes Theorem with strong (naïve)
independence assumptions between the features. Since its based on Bayes theorem, it is a
probabilistic classifier which means that it predicts on the basis of the probability of an
object.
Naïve Bayes algorithm is one of the simplest and most effective classification algorithms that
can produce fast models with high accuracy.
If we break down the name of the model:
1) Naive in statistics means assuming that the occurrence of a certain feature is

independent of the occurrence of other features.
2) Bayes refers to the principles of Bayes Theorem.
Bayes Theorem:
In probability theory, Bayes theorem is used to describe the probability of an event based on
prior knowledge of conditions that might be related to the event. It depends on the concept of
conditional probability.
The formula is given by:

P(A|B) = P(B|A)P(A)
P(B)
Where,
P(A|B) is Posterior Probability or probability of event A occurring when event B is true.
P(B|A) is Likelihood Probability or probability of event B occurring given the event A is true.
P(A) and P(B) are the probabilities of A and B individually without any conditions.
Advantages of Naïve Bayes Classifier:
- It is one of the fastest and easiest Machine Learning model for classification
problems.
- It can be used for binary as well as multi class classifications.
Disadvantages of Naïve Bayes Classifier:
- It disregards the relations between features as it considers each feature to be

independent.
Applications of Naïve Bayes Classifier:
- Text classification like: Sentiment Analysis and Spam Filtering

- Medical data classification
Types of Naïve Bayes Model:
- Gaussian: This model assumes that the features follow a gaussian or normal
distribution.
- Multinomial: This is used when the data is multinomial distributed. It is primarily
used for document classification problem, etc.
- Bernouli: It works similar to the Multinomial classifier, but predictor variables are the
independent Boolean variable.
- PROGRAM:
- #importing all packages
- import pandas as pd
- import numpy as np
- from sklearn.naive_bayes import GaussianNB
- from sklearn.model_selection import train_test_split
- from sklearn.metrics import confusion_matrix
- from sklearn.metrics import accuracy_score
-
- #Loading the Iris dataset using the 'read_csv()' function from pandas library

#The read_csv() function read the contents of a dataset stored in .csv or comma separated
values format
data = pd.read_csv("Iris.csv")
#Checking if the dataset was read properly
print(data.head())
#Our prediction target or dependent variable is the 'Species' column
#Separating the dependent variable and the independent variables using dataframe slicing
techniques from pandas library
X = data.iloc[:,1:5].values #Set of independent variables
y = data['Species'].values #Dependent variable
#Splitting the dataset into Training Set and Testing Set
#Using the 'train_test_split' function from sklear library
#The first parameter is the set of independent variables, the second parameter is the
dependent variable
#The 'test_size' parameter determines the size of the testing dataset. Here we will be using
20% of the dataset for testing
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
#Model Creating
#Creating an instance for Naive Bayes Classifier model
classifier = GaussianNB()
#fitting the training set into the model
classifier.fit(X_train, y_train)

#Predicting test results
y_predictions = classifier.predict(X_test)
#Printing the prediction results
print(y_predictions)
#Creating a confusion matrix to check the accuracy of our model
confusion_mat = confusion_matrix(y_test, y_predictions)
#Evaluating the accuracy score
accuracy = accuracy_score(y_test, y_predictions)
#Printing the accuracy score and confusion matrix
print("Accuracy: ", accuracy)
print("Confusion matrix:\n", confusion_mat)
OUTPUT:

CONCLUSION:
Signature
Viva Questions
1) What mathematical concept Naive Bayes is based on? ...
2) What are the different types of Naive Bayes classifiers? ...

3) Is Naive Bias a classification algorithm or regression algorithm? ...
4) What are some benefits of Naive Bayes? ...
5) What are the cons of Naive Bayes classifier?

7IT07

Uploaded by

Copyright:

Available Formats

7IT07

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7IT07

Uploaded by

Copyright:

Available Formats

Vidarbha Youth Welfare Society’s

Prof. Ram Meghe Institute of Technology & Research

Badnera, Amravati (M.S.) 444701

Name of Student: ______________________________________________

Roll No: ____________ __________ Section: _______________________

Department of Information Technology

Sant Gadge Baba Amravati University

Prof. Ram Meghe Institute of Technology & Research

This is to certify that Mr. / Miss___________________________________________

Date: Signature of the Faculty

Sr.No Title Page No.

1 Vision and Mission of the Institute and Program i

2 Program Educational Objectives and Program Outcomes ii

3 Syllabus, Course Learning Objectives and Course Outcomes iii

4 Assessment Format i.e., ACIPV(Guidelines) iv

5 Aim of the Lab Manual v

1. Vision and Mission of the Institute and Program:

 Mission & Vision statement of the Department of Information Technology:

• To become leading education center by inspiring the students to become competent IT

 Quality Policy of the Institute

Department of Information Technology, PRMIT&R, Badnera Page i

2. Program Educational Objective & Program Outcomes (PEO’s & PO’s):

Department of Information Technology, PRMIT&R, Badnera Page ii

3. Syllabus (Theory + Practical)

Unit wise Course Contents in brief:

Department of Information Technology, PRMIT&R, Badnera Page iii

Units covered in the course

Department of Information Technology, PRMIT&R, Badnera Page iv

Assessment Format i.e., ACIPV(Guidelines):

Regular Assessment of Experiment during the practical session

(A-Attendance, C- Competency, I-Innovation, P-Performance, V-Viva)

Department of Information Technology, PRMIT&R, Badnera Page v

1. Aim of the Lab manual:

Department of Information Technology, PRMIT&R, Badnera Page vi

Department of Information Technology

Subject: Block- Machine Learning Lab Semester: VII

University Code: 7IT07

Sr. Date Page No. Remark

1 To Installing Anaconda and Python

2 To understand How to get datasets for

Central tendency measures: Mean, Median,

5 To Study of Python Libraries for ML

6 Implement the FIND-S algorithm for finding

7 Write a Python program to implement

Department of Information Technology, PRMIT&R, Badnera Page vii

Aim: To Installing Anaconda and Python

Software Required: Python

Anaconda distribution is a free and open-source platform for Python/R programming

Step-1: Download Anaconda Python:

Department of Information Technology, PRMIT&R, Badnera Page 1

Department of Information Technology, PRMIT&R, Badnera Page 2

Step- 2: Install Anaconda Python (Python 3.7 version):

Department of Information Technology, PRMIT&R, Badnera Page 3

Now select the second option, and click on install.

Department of Information Technology, PRMIT&R, Badnera Page 4

Once the installation gets complete, click on Next.

Department of Information Technology, PRMIT&R, Badnera Page 5

Step- 3: Open Anaconda Navigator

Department of Information Technology, PRMIT&R, Badnera Page 6

Roll No: Section: _____________