22cs503 Machine Learning - Unit - III
22cs503 Machine Learning - Unit - III
22cs503 Machine Learning - Unit - III
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from your
system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
22CS503 MACHINE LEARNING
(LAB INTEGRATED)
Department :CSE
Batch/Year :2022-2026/ III
Created by :
Mrs.D.M.KALAI SELVI &
Mrs.V.SHARMILA
Date : 01.09.2024
1. TABLEOF CONTENTS
SLID
S.NO. CONTENTS E
NO.
1 CONTENTS 5
2 COURSE OBJECTIVES 8
OBJECTIVE:
To discuss the basics of Machine Learning and model evaluation.
To study dimensionality reduction techniques.
To understand the various classification algorithms.
To elaborate on unsupervised learning techniques.
To discuss the basics of neural networks and various types of learning.
TEXT BOOKS:
1. Saikat Dutt, Subramanian Chandramouli, Amit Kumar Das, “Machine Learning”,
Pearson,2019.(Unit 1 – chap 1,2,3/ Unit 2 – Chap 4 / Unit 4 – 9 / Unit 5 – Chap 10,
11)
Ethem Alpaydin, “Introduction to Machine Learning, Adaptive Computation and Machine
LearningSeries”, Third Edition, MIT Press, 2014. (Unit 2 – Chap 6 / Unit 4 – chap 8.2.3/
Unit 5 – Chap 18)
REFERENCES:
1. Anuradha Srinivasaraghavan,Vincy Joseph, “Machine Learning”, First Edition, Wiley,
2019.(Unit3 – Chap 7,8,9,10,11 / Unit 4 – 13, 11.4, 11.5,12)
2. Peter Harrington, “Machine Learning in Action”, Manning Publications, 2012.
3. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second Edition,
4. Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014.
5. Tom M Mitchell, “Machine Learning”, First Edition, McGraw Hill Education, 2013.
6. Christoph Molnar, “Interpretable Machine Learning - A Guide for Making Black Box
Models
Explainable”, Creative Commons License, 2020.
7. NPTEL Courses:
Introduction to Machine Learning - https://onlinecourses.nptel.ac.in/noc23_cs18/preview
Course Outcomes
5. COURSE OUTCOME
Cognitive
/
Affectiv
Course Course
Course Outcome Statement e Level
Code of the Outcom
Course e
Outcom
e
Programme
Outcomes and Programme Specific
Outcomes .
Course
Outcomes P P P P P P P P P P P P P P P
(Cos) O O O O O O O O O O O O S S S
1 2 3 4 5 6 7 8 9 1 1 1 O O O
0 1 2 1 2 3
K2 -
22CS503.1
3 3 1 - - - - - - - - 2 2 2
K3 -
22CS503.2
3 2 1 - - - - - - - - 2 2 2
5
K3 -
22CS503.3
3 2 1 - - - - - - - - 2 2 2
K3 -
22CS503.4
3 3 2 - - - - - - - - 2 2 2
K2 -
22CS503.5
3 2 2 - - - - - - - - 2 2 2
UNIT III
SUPERVISED LEARNING
LECTURE PLAN – UNIT III
UNIT III
Sl.
No PROPOSE ACTUA
NO D L
LECTUR PERTAI TAXONO MODE OF
OF LECTUR
TOPIC N I NG MY DELIVER
PERI E E
CO(s) LEVEL Y
O DS PEROID PERIOD
Decision Tree
10.09.24
10.09.24
7 Implentation of 1 MD1,MD5
Decision Tree K3
CO3
algorithm based on
ID3 algorithm
11.09.24 11.09.24
1 K3 MD1
8 Issues – Rule-based
Classification , MD5
CO3
11.09.24 11.09.24
1 K3 MD1
,
9 Pruning the Rule Set
CO3 MD5
10 12.09.24 12.09.24 K3
1 MD1
,
Support Vector Machines MD5
CO3
11 12.09.24 12.09.24 K3
Linear Support Vector 1 MD1
Machine – Optimal
CO3 , MD5
Hyperplane
12 Build Support Vector 16.09.24 16.09.24 K3
Machines using a dataset 1 MD1
, MD5
CO3
13 16.09.24 16.09.24 K3
1 MD1
, MD5
Bayesian Belief Networks CO3
14 18.09.24 18.09.24 K3
1 MD1
CO3 , MD5
LECTURE PLAN – UNIT II
ACTIVITY 1:
S NO TOPICS
Down
1.A post-prediction
adjustment, typically to
account for prediction bias.
2.A TensorFlow
programming environment
in
which operations run immediately.
4.Obtaining an understanding of
data by
considering samples, measurement,
and visualization.
5. An ensemble approach to finding the
decision tree that best fits the training data
7. state-action value function
8.Loss function based on the
absolute value of
the difference between the values that a
Across model
3. In machine learning, a mechanism is predicting and the actual values of the
for labels
bucketing categorical data 10. A metric that your algorithm is
6. The primary algorithm for trying to
optimize.
performing
11. The recommended format for
gradient descent on neural networks.
saving and
9. Abbreviation for independently and recovering TensorFlow models.
identically distributed 14. A statistical way of comparing two
12. The more common label in a class- (or
imbalanced dataset. more) techniques, typically an incumbent
13. Applying a constraint to an against a new rival.
algorithm to 15. When one number in your model
becomes
ensure one or more definitions of fairness
a a NaN during training, which causes many
orre
all other numbers in your model to
satisfied. eventually
18. A process used, as part of training, become a NaN.
to 16. Q-learning In reinforcement
evaluate the quality of a machine learning learning,
model using the validation set. implementing Q-learning by using a table to
19. A coefficient for a feature in a store the Q-functions
linear 17. A popular Python machine learning
API
model, or an edge in a deep network.
20.A column-oriented data
analysis API.
21.Abbreviation for
generative adversarial
network
ACTIVITY BASED LEARNING
(MODEL BUILDING/PROTOTYPE)
S NO TOPICS
Work Sheet
22
ACTIVITY BASED LEARNING
(MODEL BUILDING/PROTOTYPE)
S NO TOPICS
Work Sheet
23
8. ACTIVITY BASED LEARNING : UNIT – III
Binary classification refers to those classification tasks that have two class labels.
Examples include:
The class for the normal state is assigned the class label 0 and the class with the
abnormal state is assigned the class label 1. It is common to model a binary
classification task with a model that predicts a Bernoulli probability distribution for
each example.
The Bernoulli distribution is a discrete probability distribution that covers a case where
an event will have a binary outcome as either a 0 or 1. For classification, this means
that the model predicts a probability of an example belonging to class 1, or the
abnormal state.
Some algorithms are specifically designed for binary classification and do not natively
support more than two classes; examples include Logistic Regression and Support
Vector Machines.
Next, let’s take a closer look at a dataset to develop an intuition for binary
classification problems.
We can use the make_blobs() function to generate a synthetic binary classification
dataset.
9. LECTURE NOTES : UNIT – III
SUPERVISED LEARNING
Syllabus:
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between
x (input) and y(output). Hence, the name is Linear Regression. Inthe
figure above, X (input) is the work experience and Y (output) is the salary of a person.The
regression line is the best fit line for our model.
By achieving the best-fit regression line, the model aims to predict y value such that the error
difference between predicted value and true value is minimum. So, it is very important to
update the θ1 and θ2 values, to reach the best value that minimize the error between predicted
y value (pred) and true y value (y).
Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between
predicted y value (pred) and true y value (y).
Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value) and
achieving the best fit line the model uses Gradient Descent. The idea is to start with random
θ1 and θ2 values and then iteratively updating the values, reaching minimum cost.
This type of statistical model (also known as logit model) is often used for classification and
predictive analytics. Logistic regression estimates the probability of an event occurring, such
as voted or didn’t vote, based on a given dataset of independent variables. Since the outcome
is a probability, the dependent variable is bounded between 0 and 1. In logistic regression, a
logit transformation is applied on the odds—that is, the probability of success divided by the
probability of failure. This is also commonly known as the log odds, or the natural logarithm
of odds, and this logistic function is represented by the following formulas:
Logit(pi) = 1/(1+ exp(-pi))
In this logistic regression equation, logit(pi) is the dependent or response variable and x is the
independent variable. The beta parameter, or coefficient, in this model is commonly estimated
via maximum likelihood estimation (MLE). This method tests different values of beta through
multiple iterations to optimize for the best fit of log odds. All of these iterations produce the
log likelihood function, and logistic regression seeks to maximize this function to find the best
parameter estimate. Once the optimal coefficient (or coefficients if there is more than one
independent variable) is found, the conditional probabilities for each observation can be
calculated, logged, and summed together to yield a predicted probability. For binary
classification, a probability less than .5 will predict 0 while a probability greater than 0 will
predict 1. After the model has been computed, it’s best practice to evaluate the how well the
model predicts the dependent variable, which is called goodness of fit. The Hosmer–Lemeshow
test is a popular method to assess model fit.
Log odds can be difficult to make sense of within a logistic regression data analysis. As a result,
exponentiating the beta estimates is common to transform the results into an odds ratio (OR),
easing the interpretation of results. The OR represents the odds that an outcome will occur
given a particular event, compared to the odds of the outcome occurring in the
absence of that event. If the OR is greater than 1, then the event is associated with a higher
odd of generating a specific outcome. Conversely, if the OR is less than 1, then the event is
associated with a lower odd of that outcome occurring. Based on the equation from above,
the interpretation of an odds ratio can be denoted as the following: the odds of a success
changes by exp(cB_1) times for every c-unit increase in x. To use an example, let’s say that
we were to estimate the odds of survival on the Titanic given that the person was male, and
the odds ratio for males was .0810. We’d interpret the odds ratio as the odds of survival of
males decreased by a factor of .0810 when compared to females, holding all other variables
constant.
There are three types of logistic regression models, which are defined based on categorical
response.
Logistic regression is commonly used for prediction and classification problems. Some of these
use cases include:
Fraud detection: Logistic regression models can help teams identify data anomalies,
which are predictive of fraud. Certain behaviors or characteristics may have a higher
association with fraudulent activities, which is particularly helpful to banking and other
financial institutions in protecting their clients. SaaS-based companies have also started
to adopt these practices to eliminate fake user accounts from their datasets when
conducting data analysis around business performance.
Disease prediction: In medicine, this analytics approach can be used to predict the
likelihood of disease or illness for a given population. Healthcare organizations can set
up preventative care for individuals that show higher propensity for specific illnesses.
Churn prediction: Specific behaviors may be indicative of churn in different functions
of an organization. For example, human resources and management teams may want
to know if there are high performers within the company who are at risk of leaving the
organization; this type of insight can prompt conversations to understand problem
areas within the company, such as culture or compensation. Alternatively, the sales
organization may want to learn which of their clients are at risk of taking their business
elsewhere. This can prompt teams to set up a retention strategy to avoid lost revenue.
Below are some assumptions that we made while using decision tree:
At the beginning, we consider the whole training set as the root.
Feature values are preferred to be categorical. If the values are continuous then
they are discretized prior to building the model.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or the internal node.
As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the
use of computer in the daily life of the people. In Decision Tree the major challenge is to
identification of the attribute for the root node in each level.
This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller
subsets the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with
A = v, and Values (A) is the set of all possible values of A, then
Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the
impurity of an arbitrary collection of examples. The higher the entropy more the
information content.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with
A = v, and Values (A) is the set of all possible values of A, then
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
=-(-0.53-0.424)
= 0.954
The essentials:
Start with all training instances associated with the root node
Use info gain to choose which attribute to label each node with
Note: No root-to-leaf path should contain the same discrete attribute twice
Recursively construct each subtree on the subset of training instances that would
be classified down that path in the tree.
The border cases:
If all positive or all negative training instances remain, label that node “yes” or “no”
accordingly
If no attributes remain, label with a majority vote of training instances left at that
node
If no instances remain, label with a majority vote of the parent’s training instances
Example:
Now, lets draw a Decision Tree for the following data using Information gain.
Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Split on feature X
Split on feature Y
Split on feature Z
From the above images we can see that the information gain is maximum when we make a
split on feature Y. So, for the root node best suited feature is feature Y. Now we can see that
while splitting the dataset by feature Y, the child contains pure subset of the target variable.
So we don’t need to further split the dataset.
The final tree for the above dataset would be look like this:
2. Gini Index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
It means an attribute with lower Gini index should be preferred.
Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
The Formula for the calculation of the of the Gini Index is given below,
Example:
Lets consider the dataset in the image below and draw a decision tree using gini
index.
Index A B C D E
In the dataset above there are 5 attributes from which attribute E is the predicting
feature which contains 2(Positive & Negative) classes. We have an equal proportion
for both the classes.In Gini Index, we have to choose some random values to
categorize each attribute. These values for this dataset are:
A B C D
Using the same approach we can calculate the Gini index for C and D attributes.
Positive Negative
|<5 3 1
Ginin Index of A = 0.45825
Positive Negative
|< 3.0 0 4
Positive Negative
|< 4.2 8 2
Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if..else” rules. These rules are easily interpretable and thus these
classifiers are generally used to generate descriptive models. The condition used with “if” is
called the antecedent and the predicted class of each rule is called the consequent.
Properties of rule-based classifiers:
Coverage: The percentage of records which satisfy the antecedent conditions of a
particular rule.
The rules generated by the rule-based classifiers are generally not mutually
exclusive, i.e. many rules can cover the same record.
The rules generated by the rule-based classifiers may not be exhaustive, i.e. there
may be some records which are not covered by any of the rules.
The decision boundaries created by them is linear, but these can be much more
complex than the decision tree because the many rules are triggered for the same
record.
An obvious question, which comes into the mind after knowing that the rules are not mutually
exclusive is that how would the class be decided in case different rules with different
consequent cover the record.
Either rules can be ordered, i.e. the class corresponding to the highest priority rule
triggered is taken as the final class.
Otherwise, we can assign votes for each class depending on some their weights,
i.e. the rules remain unordered.
Example:
Below is the dataset to classify mushrooms as edible or poisonous:
Cap
Cap Surfac Bruise Stalk Populatio
Class Shape e s Odour Shape n Habitat
conve meadow
Edible x Scaly yes almond tapering scattered s
enlargenin
Edible flat fibrous yes anise g several woods
Cap
Cap Surfac Bruise Stalk Populatio
Class Shape e s Odour Shape n Habitat
enlargenin
Edible flat fibrous no none g several urban
3.4.1 Rules:
The algorithm given below generates a model with unordered rules and ordered classes, i.e.
we can decide which class to give priority while generating the rules.
A <-Set of attributes
T <-Set of training records
Y <-Set of classes
Y’ <-Ordered Y according to relevance
R <-Set of rules generated, initially to an empty list
for each class y in Y’
while the majority of class y records are not covered
generate a new rule for class y, using methods given above
Add this rule to R
Remove the records covered by this rule from T
end while
end for
Add rule {}->y’ where y’ is the default class
Note: The rule set can be also created indirectly by pruning(simplifying) other
already generated models like a decision tree.
1. Rule Generation
Once a decision tree has been constructed, it is a simple matter to convert it into an equivalent
set of rules.
Converting a decision tree to rules before pruning has three main advantages:
To generate rules, trace each path in the decision tree, from root node to leaf node, recording
the test outcomes as antecedents and the leaf-node classification as the consequent.
C1 C2 Marginal Sums
R1 x11 x12 R1T = x11 + x12
R2 x21 x22 R2T = x21 + x22
Marginal Sums CT1 = x11 + x21 CT2 = x12 + x22 T = x11 + x12 + x21 + x22
The marginal sums and T, the total frequency of the table, are used to calculate expected
cell values in step 3 of the test for independence.
The general formula for obtaining the expected frequency of any cell xij, 1 i r, 1 j c in a
contingency table is given by:
where RiT and CTj are the row total for ith row and the column total for jth column.
if then use
m 10 Chi-Square Test
5 m 10 Yates' Correction for Continuity
m 5 Fisher's Exact Test
= (r - 1)(c - 1)
7. Use a chi-square table with and df to determine if the conclusions are independent
from the antecedent at the selected level of significance, .
o If
Reject the null hypothesis of independence and accept the alternate
hypothesis of dependence.
We keep the antecedents because the conclusions are
dependent upon them.
· Chi-Square Test
Decision Lists
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified
using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs sothat it can
learn about different features of cats and dogs, and then we test it with this strangecreature.
So as support vector creates a decision boundary between these two data (cat anddog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog.On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
1. Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors.
The distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
A Radial basis function is a function whose value depends only on the distance from the origin.
In effect, the function must contain only real values. Alternative forms of radial basis functions
are defined as the distance from another point denoted C, called a center.
Source
A Radial basis function works by defining itself by the distance from its origin or center. This
is done by incorporating the absolute value of the function. Absolute values are defined as the
value without its associated sign (positive or negative). For example, the absolute value of -
4, is 4. Accordingly, the radial basis function is a function in which its values are defined as:
The Gaussian variation of the Radial Basis Function, often applied in Radial Basis Function
Networks, is a popular alternative. The formula for a Gaussian with a one-dimensional input
is:
The Gaussian function can be plotted out with various values for Beta:
Source
Radial basis functions make up the core of the Radial Basis Function Network, or RBFN. This
particular type of neural network is useful in cases where data may need to be classified in a
non-linear way. RBFNs work by incorporating the Radial basis function as a neuron and using
it as a way of comparing input data to training data. An input vector is processed by multiple
Radial basis function neurons, with varying weights, and the sum total of the neurons produce
a value of similarity. If input vectors match the training data, they will have a high similarity
value. Alternatively, if they do not match the training data, they will not be assigned a high
similarity value. Comparing similarity values with different classifications of data allows for
non-linear classification.
What is a classifier?
A classifier is a machine learning model that is used to discriminate different objects based
on certain features.
3.8.1 Principle of Naive Bayes Classifier:
A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not affect
the other. Hence it is called naive. Example: Let us take an example to get some better
first row of the dataset, we can observe that is not suitable for playing golf if the outlook is
rainy, temperature is hot, humidity is high and it is not windy. We make two assumptions here,
one as stated above we consider that these predictors are independent. That is, if the
temperature is hot, it does not necessarily mean that the humidity is high. Another assumption
made here is that all the predictors have an equal effect on the outcome. That is, the day
being windy does not have more importance in deciding to play golf or not.
The variable y is the class variable(play golf), which represents if it is suitable to play golf or
not given the conditions. Variable X represent the parameters/features.
X is given as,
Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook, temperature,
humidity and windy. By substituting for X and expanding using the chain rule we get,
Now, you can obtain the values for each by looking at the dataset and substitute them into
the equation. For all entries in the dataset, the denominator does not change, it remain static.
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate. Therefore, we need to find the class y with
maximum probability.
Using the above function, we can obtain the class, given the predictors.
Since the way the values are present in the dataset changes, the formula for conditional
3.8.3 Conclusion: Naive Bayes algorithms are mostly used in sentiment analysis, spam
filtering, recommendation systems etc. They are fast and easy to implement but their biggest
disadvantage is that the requirement of predictors to be independent. In most of the real life
cases, the predictors are dependent, this hinders the performance of the classifier.
Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively.
But, there are few drawbacks in this case, as sometimes ‘P1’ may forget to call the
person ‘gfg’, even after hearing the alarm, as he has a tendency to forget things,
quick. Similarly, ‘P2’, sometimes fails to call the person ‘gfg’, as he is only able to
hear the alarm, from a certain distance.
Example Problem:
Q)Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called ‘gfg’)
when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’
events]
[ Note: The values mentioned below are neither calculated nor computed. They have
observed values ]
Burglary ‘B’ –
P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred) Fire
‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have
rung). It has two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’
(i.e may have occurred or may not have occurred) depending upon different
conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person ‘gfg’
or not) . It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may
have rung or may not have rung ,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have
rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
Category I
Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select the appropriate data set for your experiment and draw graphs.
(K4,CO3)
Category II
The rules are easily interpretable and thus these classifiers are generally used to
generate descriptive models. The condition used with “if” is called the antecedent
and the predicted class of each rule is called the consequent. Properties of rule-
based classifiers:Apply Rule based classification to the dataset to classify Weather
prediction. (K4,CO3)
The rules are easily interpretable and thus these classifiers are generally used to
generate descriptive models. The condition used with “if” is called the antecedent
and the predicted class of each rule is called the consequent. Properties of rule-
based classifiers:Apply Rule based classification to the dataset to classify Health
care. (K4,CO3)
Category III
Construct a Decision tree using attributes selection measure Information Gain .
(K3,CO3)
Construct a Decision tree using attributes selection measure Gini Ratio. (K3,CO3)
Category IV
Elaborate in detail about Linear Regression and hypothesis function (K2,CO3)
Category V
Ellucidate in detail about Radial basis function? (K2,CO3)
Classification: It uses algorithms to assign the test data into specific categories.
Common classification algorithms are linear classifiers, support vector machines
(SVM), decision trees, k-nearest neighbor, and random forest.
Regression: It is used to understand the relationship between dependent
and independent variables. Linear regression, logistical regression, and
polynomial regression are popular regression algorithms.
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection. The objective of the support
vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the
number of features) that distinctly classifies the data points.
Support vector machines focus only on the points that are the most difficult to tell
apart, whereas other classifiers pay attention to all of the points.
The intuition behind the support vector machine approach is that if a classifier is good
at the most challenging comparisons (the points in B and A that are closest to each
other), then the classifier will be even better at the easy comparisons (comparing
points in B and A that are far away from each other).
o You get a bunch of photos with information about what is on them and
then you train a model to recognize new photos.
o You have a bunch of molecules and information about which are
drugs and you train a model to answer whether a new molecule is also
a drug.
o Based on past information about spams, filtering out a new incoming
email into Inbox (normal) or Junk folder (Spam)
o Cortana or any speech automated system in your mobile phone trains
your voice and then starts working based on this training.
o Train your handwriting to OCR system and once trained, it will be able
to convert your hand-writing images into text (till some accuracy
obviously)
The image above shows how similar points are closer to each other. KNN hinges
on this assumption being true enough for the algorithm to be useful.
There are many different ways of calculating the distance between the points,
however, the straight line distance (Euclidean distance) is a popular and familiar
choice.
The goal of any supervised machine learning algorithm is to best estimate the mapping
function (f) for the output variable (Y) given the input data (X). The mapping function
is often called the target function because it is the function that a given supervised
machine learning algorithm aims to approximate.
Bias are the simplifying assumptions made by a model to make the target function
easier to learn. Generally, linear algorithms have a high bias making them fast to learn
and easier to understand but generally less flexible.
Answer
Correlation measures how strongly two or more variables are related to each
other. Its values are between -1 to 1. Correlation measures both the strength
and direction of the linear relationship between two variables. Correlation is a
function of the covariance.
In a supervised logistic regression, features are mapped onto the output. The
output is usually a categorical value (which means that it is mapped with one-
hot vectors or binary numbers).
Since the logit function always outputs a value between 0 and 1, it gives
the probability of the outcome.
8. What are some challenges faced when using a Supervised Regression Model?
(CO3,K2)
For example, Naive Bayes works best when the training set is large. Models
with low bias and high variance tend to perform better as they work fine with
complex relationships.
10. When Will You Use Classification over Regression? (CO3,K2)
Classification is used when your target is categorical, while regression is used
when your target variable is continuous. Both classification and regression
belong to the category of supervised machine learning algorithms.
Examples of classification problems include:
Predicting yes or no
Estimating gender
Breed of an animal
Type of color
(CO3,K3)
2. Explain in detail about Logistic regression and its types with use-cases?
(CO3,K3)
3. Elucidate the Decision tree algorithm and explain the construction of Decision tree?
(CO3,K3)
4. Explain in detail about the two attributes and types of decision tree algorithm? (CO3,K3)
5. Explain about construction of decision tree and pruning made to the branches? (CO3,K3)
6. Explain the conditions how the pruning rules sets are considered? (CO3,K3)
9. What is the major condition to be followed for selecting optimal hyper plane? Explain
.
13. Supportive online Certification courses
2. COURSERA- https://www.coursera.org/learn/machine-learning
14. REAL TIME APPLICATIONS IN DAY-TO-DAY LIFE AND
TO INDUSTRY
1 IAT 1 22-08-2024
UNIT 1 & 2
2 IAT 2 30-09-2024
UNIT 3 & 4
3 ALL 5 UNITS
Model 26-10-2024
16. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS
TEXT BOOKS:
1. Saikat Dutt, Subramanian Chandramouli, Amit Kumar Das, Machine Learning,
Pearson, 2019. (Unit 1 – chap 1,2,3/ Unit 2 – Chap 4 / Unit 4 – 9 / Unit 5 –
Chap 10, 11)
2. Ethem Alpaydin, Introduction to Machine Learning, Adaptive Computation and
Machine Learning Series, Third Edition, MIT Press, 2014. (Unit 2 – Chap 6 / Unit
4 – chap 8.2.3 / Unit 5 – Chap 18)
REFERENCES:
1. Anuradha Srinivasaraghavan, Vincy Joseph, Machine Learning, First Edition,
Wiley, 2019. (Unit 3 – Chap 7,8,9,10,11 / Unit 4 – 13, 11.4, 11.5,12)
4. Tom M Mitchell, Machine Learning, First Edition, McGraw Hill Education, 2013.
5. Christoph Molnar, “Interpretable Machine Learning - A Guide for Making Black
Box Models Explainable”, Creative Commons License, 2020.
17. MINI PROJECT SUGGESTION
Category I
Almost everyone today uses technology to stream movies and television shows. While
figuring out what to stream next can be daunting, recommendations are often made
based on a viewer’s history and preferences. This is done through machine learning
and can be a fun and easy project for beginners to take on. New programmers can
practice by coding in either Python or R languages and with data from the Movielens
Dataset. Generated by more than 6,000 users, Movielens currently includes more than
1 million movie ratings of 3,900 films.
Category II
This is one of the interesting machine learning project ideas. Although most of us use
social media platforms to convey our personal feelings and opinions for the world to
see, one of the biggest challenges lies in understanding the ‘sentiments’ behind social
media posts.
Social media is thriving with tons of user-generated content. By creating an ML system
that could analyze the sentiment behind texts, or a post, it would become so much
easier for organizations to understand consumer behaviour. This, in turn, would allow
them to improve their customer service, thereby providing the scope for optimal
consumer satisfaction.
You can try to mine the data from Twitter or Reddit to get started off with your
sentiment analyzing machine learning project. This might be one of those rare cases
of deep learning projects which can help you in other aspects as well.
Category III
Category IV
Spam email classification using SupportVector Machine
You will use a SVM to classify emails into spam or non-spam categories. And report
the classification accuracy for various SVM parameters and kernel functions.
Data Set Description: An email is represented by various features like frequency of
occurrences of certain keywords, length of capitalized words etc.
A data set containing about 4601 instances are available in this link (data folder):
https://archive.ics.uci.edu/ml/datasets/Spambase.
You have to randomly pick 70% of the data set as training data and the remaining as
test data.
Assignment Tasks: In this assignment you can use any SVM package to classify the
above data set.
You should use one of the following languages: C/C++/Java/Python. You have to
study performance of the SVM algorithms.
You have to submit a report in pdf format.
The report should contain the following sections:
1. Methodology: Details of the SVM package used.
2. Experimental Results:
i. You have to use each of the following three kernel functions (a) Linear, (b)
Quadratic, (c) RBF.
ii. For each of the kernels, you have to report training and test set classification
accuracy for the best value of generalization constant C.
The best C value is the one which provides the best test set accuracy that you have
found out by trial of different values of C. Report accuracies in the form of a
comparison table, along with the values of C.
Category V
This document is confidential and intended solely for the educational purpose of RMK
Group of Educational Institutions. If you have received this document through email
in error, please notify the system manager. This document contains proprietary
information and is intended only to the respective group / learning community as
intended. If you are not the addressee you should not disseminate, distribute or copy
through e-mail. Please notify the sender immediately by e-mail if you have received
this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking
any action in reliance on the contents of this information is strictly prohibited.