Wa0030.
Wa0030.
Wa0030.
Week 1
Prof. B. Ravindran, IIT Madras
1. (2 Marks) Which of the following are supervised learning problems (Multiple Correct)?
(a) Clustering Spotify users based on their listening history
(b) Weather forecast using data collected by a satellite
(c) Predicting tuberculosis using patient’s chest X-Ray
(d) Training a humanoid to walk using a reward system
Sol. b and c
2. (2 Marks) Which of the following are regression tasks (Multiple Correct)?
(a) Predicting the outcome of an election
(b) Predicting the weight of a giraffe based on its weight
(c) Predicting the emotion conveyed by a sentence
(d) Identifying abnormal data points
Sol. b
3. (2 Marks) Which of the following are classification tasks (Multiple Correct)?
(a) Predicting the outcome of an election
(b) Predicting the weight of a giraffe based on its weight
(c) Predicting the emotion conveyed by a sentence
(d) Identifying abnormal data points
Sol. a, and c
1
4. (1 Mark) Which of the two functions overfit the training data?
Sol. c
5. (1 Mark) Which of the following 2 functions will yield higher training error?
(a) Function F1
(b) Function F2
(c) Both functions F1 & F2 will have the same training error
(d) Can not be determined
Sol. a
6. (1 Mark) What does the term ‘policy’ refer to in reinforcement learning?
Sol. d
7. (1 Mark) Given the following dataset, for k = 3, use KNN regression to find the prediction for
a new data-point (2,3) (Use Euclidean distance measure for finding closest points)
X1 X2 Y
2 5 3.4
5 5 5
3 3 3
6 3 4.5
2 2 2
4 1 2.8
(a) 2.0
(b) 2.6
(c) 2.8
(d) 3.2
Sol. c: The closest k points are (2,5), (3,3) and (2,2). Their corresponding labels averaged is
(3.4 + 3 + 2) / 3 = 2.8
8. (1 Mark) For any given dataset, comment on the bias of K-nearest classifiers upon increasing
the value of K.
2
(a) The bias of the classifier decreases
(b) The bias of the classifier does not change
(c) The bias of the classifier increases
(d) Can not be determined
Sol. a
10. (1 Mark) Which of the following statements are FALSE regarding bias and variance?
Sol. a and c
3
Introduction to Machine Learning
Week 2
Prof. B. Ravindran, IIT Madras
1. (1 Mark) State True or False: Typically, linear regression tend to underperform compared
to k-nearest neighbor algorithms when dealing with high-dimensional input spaces.
(a) True
(b) False
Sol. b
2. (2 Marks) Given the following dataset, find the uni-variate regression function that best fits
the dataset.
X Y
2 5.5
3 6.5
4 9
10 18.5
(a) f (x) = 1 × x + 4
(b) f (x) = 1 × x + 5
(c) f (x) = 1.5 × x + 3
(d) f (x) = 2 × x + 1
Soln. C
3. (1 Mark) Given a training data set of 500 instances, with each input instance having 6 di-
mensions and each output being a scalar value, the dimensions of the design matrix used in
applying linear regression to this data is
(a) 500 × 6
(b) 500 × 7
(c) 500 × 62
(d) None of the above
Soln. B
4. (1 Mark) Assertion A: Binary encoding is usually preferred over One-hot encoding to repre-
sent categorical data (eg. colors, gender etc)
Reason R: Binary encoding is more memory efficient when compared to One-hot encoding
1
(d) A is false but R is true
Soln. D
5. (1 Mark) Select the TRUE statement.
(a) Subset selection methods are more likely to improve test error by only focussing on the
most important features and by reducing variance in the fit.
(b) Subset selection methods are more likely to improve train error by only focussing on the
most important features and by reducing variance in the fit.
(c) Subset selection methods are more likely to improve both test and train error by focussing
on the most important features and by reducing variance in the fit.
(d) Subset selection methods don’t help in performance gain in any way.
Sol. a
6. (1 Mark) Rank the 3 subset selection methods in terms of computational efficiency:
(a) Forward stepwise selection, best subset selection, and forward stagewise regression.
(b) forward stepwise selection, forward stagewise regression and best subset selection.
(c) Best subset selection, forward stagewise regression and forward stepwise selection.
(d) Best subset selection, forward stepwise selection and forward stagewise regression.
Sol. b
7. (1 Mark) Choose the TRUE statements from the following: (Multiple correct choice)
(a) Ridge regression since it reduces the coefficients of all variables, makes the final fit a lot
more interpretable.
(b) Lasso regression since it doesn’t deal with a squared power is easier to optimize than
ridge regression.
(c) Ridge regression has a more stable optimization than lasso regression.
(d) Lasso regression is better suited for interpretability than ridge regression.
Sol. c, d
8. (2 Marks) Which of the following statements are TRUE? Let xi be the i-th datapoint in a
dataset of N points. Let v represent the first principal component of the dataset. (Multiple
answer questions)
PN
(a) v = arg max i=1 (v T xi )2 s.t. |v| = 1
PN
(b) v = arg min i=1 (v T xi )2 s.t. |v| = 1
(c) Scaling at the start of performing PCA is done just for better numerical stability and
computational benefits but plays no role in determining the final principal components
of a dataset.
(d) The resultant vectors obtained when performing PCA on a dataset can vary based on the
scale of the dataset.
Soln. A and D
2
Introduction to Machine Learning
Week 3
Prof. B. Ravindran, IIT Madras
1. (1 Mark) For a two-class problem using discriminant functions (δk - discriminant function for
class k), where is the separating hyperplane located?
Soln. C
2. (1 Mark) Given the following dataset consisting of two classes, A and B, calculate the prior
probability of each class.
Feature 1 Class
2.3 A
1.8 A
3.2 A
2.7 B
3.0 A
2.1 A
1.9 B
2.4 B
Soln. B
3. (1 Mark) In a 3-class classification problem using linear regression, the output vectors for three
data points are [0.8, 0.3, -0.1], [0.2, 0.6, 0.2], and [-0.1, 0.4, 0.7]. To which classes would these
points be assigned?
(a) 1, 2, 1
(b) 1, 2, 2
(c) 1, 3, 2
(d) 1, 2, 3
1
Soln. D
4. (1 Mark) If you have a 5-class classification problem and want to avoid masking using polyno-
mial regression, what is the minimum degree of the polynomial you should use?
(a) 3
(b) 4
(c) 5
(d) 6
Soln. B
5. (1 Mark) Consider a logistic regression model where the predicted probability for a given data
point is 0.4. If the actual label for this data point is 1, what is the contribution of this data
point to the log-likelihood?
(a) -1.3219
(b) -0.9163
(c) +1.3219
(d) +0.9163
Soln. B
6. (1 Mark) What additional assumption does LDA make about the covariance matrix in com-
parison to the basic assumption of Gaussian class conditional density?
(a) The covariance matrix is diagonal
(b) The covariance matrix is identity
(c) The covariance matrix is the same for all classes
(d) The covariance matrix is different for each class
Soln. C
7. (1 Mark) What is the shape of the decision boundary in LDA?
(a) Quadratic
(b) Linear
(c) Circular
(d) Can not be determined
Soln. B
2 2
8. (1 Mark) For two classes C1 and C2 with within-class variances σw1 = 1 and σw2 = 4 respec-
′ ′
tively, if the projected means are µ1 = 1 and µ2 = 3, what is the Fisher criterion J(w)?
(a) 0.5
(b) 0.8
(c) 1.25
2
(d) 1.5
Soln. B
Sw = σw12 2
+ σw2 = 1 + 4 = 5 Sb = (µ′2 − µ′1 )2 = (3 − 1)2 = 4 J(w) = SSwb = 45 = 0.8
2 5
9. (2 Marks) Given two classes C1 and C2 with means µ1 = and µ2 = respectively, what
3 7
is the direction vector w for LDA when the within-class covariance matrix Sw is the identity
matrix I?
4
(a)
3
5
(b)
7
0.7
(c)
0.7
0.6
(d)
0.8
Soln. D
Sw ∝ µ1 − µ2
3
Introduction to Machine Learning
Week 4
Prof. B. Ravindran, IIT Madras
f (x)
1. (1 Mark) In the context of the perceptron learning algorithm, what does the expression ||f ′ (x)||
represent?
Soln. B
2. (1 Mark) Why do we normalize by ∥β∥ (the magnitude of the weight vector) in the SVM
objective function?
(a) To ensure the margin is independent of the scale of β
(b) To minimize the computational complexity of the algorithm
(c) To prevent overfitting
(d) To ensure the bias term is always positive
Soln. A
3. (1 Mark) Which of the following is NOT one of the KKT conditions for optimization problems
with inequality constraints?
Pm Pp
(a) Stationarity: ∇f (x∗ ) + i=1 λi ∇gi (x∗ ) + j=1 νj ∇hj (x∗ ) = 0
(b) Primal feasibility: gi (x∗ ) ≤ 0 for all i, and hj (x∗ ) = 0 for all j
(c) Dual feasibility: λi ≥ 0 for all i
(d) Convexity: The objective function f (x) must be convex
Soln. D
x y
-1 1
0 -1
2 1
1
State true or false: The dataset becomes
linearly separable after using basis expansion with
1
the following basis function ϕ(x) = 3
x
(a) True
(b) False
Soln. B
′ 1 ′ 1 ′ 1
After applying basis expansion, x1 = , x2 = , and x3 = . Despite the basis
−1 0 8
expansion, it still remains linearly inseparable.
(a) p × d
(b) (p + 1) × d
(c) p+d
d
(d) pd
Soln. C
6. (1 Mark) State True or False: For any given linearly separable data, for any initialization,
both SVM and Perceptron will converge to the same solution.
(a) True
(b) False
Soln. B
For Q7,8: Kindly download the modified version of Iris dataset from this link.
Available at: (https://goo.gl/vchhsd)
The dataset contains 150 points, and each input point has 4 features and belongs to one among
three classes. Use the first 100 points as the training data and the remaining 50 as test data.
In the following questions, to report accuracy, use the test dataset. You can round off the
accuracy value to the nearest 2-decimal point number. (Note: Do not change the order of data
points.)
7. (2 marks) Train a Linear perceptron classifier on the modified iris dataset. We recommend
using sklearn. Use only the first two features for your model and report the best classification
accuracy for l1 and l2 penalty terms.
2
Sol. (d)
Following code will give the desired result.
>>clf = Perceptron(penalty=”l1”).fit(X[0:100,0:2],Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])
>>clf = Perceptron(penalty=”l2”).fit(X[0:100,0:2],Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])
8. (2 marks) Train a SVM classifier on the modified iris dataset. We recommend using sklearn.
Use only the first three features. We encourage you to explore the impact of varying different
hyperparameters of the model. Specifically try different kernels and the associated hyperpa-
rameters. As part of the assignment train models with the following set of hyperparameters
RBF-kernel, gamma = 0.5, one-vs-rest classifier, no-feature-normalization.
Try C = 0.01, 1, 10. For the above set of hyperparameters, report the best classification accu-
racy.
(a) 0.98
(b) 0.88
(c) 0.99
(d) 0.92
Sol. (a)
Following code will give the desired result.
>>clf = svm.SVC( C=1.0, kernel=’rbf’, decision function shape=’ovr’, gamma = 0.5)).fit(X[0:100,0:3],
Y[0:100])
>>clf.score(X[100:,0:3], Y[100:])
3
Introduction to Machine Learning
Week 5
Prof. B. Ravindran, IIT Madras
1. (1 Mark) Given a 3 layer neural network which takes in 10 inputs, has 5 hidden units and
outputs 10 outputs, how many parameters are present in this network?
(a) 115
(b) 500
(c) 25
(d) 100
Soln. A
2. (1 Mark) Recall the XOR(tabulated below) example from class where we did a transformation
of features to make it linearly separable. Which of the following transformations can also
work?
X1 X2 Y
-1 -1 -1
1 -1 1
-1 1 1
1 1 -1
3. We use several techniques to ensure the weights of the neural network are small (such as
random initialization around 0 or regularisation). What conclusions can we draw if weights of
our ANN are high?
(a) Model has overfitted.
(b) It was initialized incorrectly.
(c) At least one of (a) or (b).
(d) None of the above.
Sol. (d)
Overfitting may be because of high weights but the two are not always associated.
1
4. (1 Mark) In a basic neural network, which of the following is generally considered a good
initialization strategy for the weights?
(a) Initialize all weights to zero
(b) Initialize all weights to a constant non-zero value (e.g., 0.5)
(c) Initialize weights randomly with small values close to zero
(d) Initialize weights with large random values (e.g., between -10 and 10)
Soln. C
5. (1 Mark) Which of the following is the primary reason for rescaling input features before
passing them to a neural network?
(a) To increase the complexity of the model
(b) To ensure all input features contribute equally to the initial learning process
(c) To reduce the number of parameters in the network
(d) To eliminate the need for activation functions
Soln. B
6. (1 Mark) In the Bayesian approach to machine learning, we often use the formula: P (θ|D) =
P (D|θ)P (θ)
P (D) Where θ represents the model parameters and D represents the observed data.
Which of the following correctly identifies each term in this formula?
(a) P (θ|D) is the likelihood, P (D|θ) is the posterior, P (θ) is the prior, P (D) is the evidence
(b) P (θ|D) is the posterior, P (D|θ) is the likelihood, P (θ) is the prior, P (D) is the evidence
(c) P (θ|D) is the evidence, P (D|θ) is the likelihood, P (θ) is the posterior, P (D) is the prior
(d) P (θ|D) is the prior, P (D|θ) is the evidence, P (θ) is the likelihood, P (D) is the posterior
Soln. B
7. (1 Mark) Why do we often use log-likelihood maximization instead of directly maximizing the
likelihood in statistical learning?
(a) Log-likelihood provides a different optimal solution than likelihood maximization
(b) Log-likelihood is always faster to compute than likelihood
(c) Log-likelihood turns products into sums, making computations easier and more numeri-
cally stable
(d) Log-likelihood allows us to avoid using probability altogether
Soln. C
8. (1 Mark) In machine learning, if you have an infinite amount of data, but your prior distribution
is incorrect, will you still converge to the right solution?
(a) Yes, with infinite data, the influence of the prior becomes negligible, and you will converge
to the true underlying solution.
(b) No, the incorrect prior will always affect the convergence, and you may not reach the true
solution even with infinite data.
2
(c) It depends on the type of model used; some models may still converge to the right solution,
while others might not.
(d) The convergence to the right solution is not influenced by the prior, as infinite data will
always lead to the correct solution regardless of the prior.
Soln. A
9. Statement: Threshold function cannot be used as activation function for hidden layers.
Reason: Threshold functions do not introduce non-linearity.
3
Introduction to Machine Learning
Week 6
Prof. B. Ravindran, IIT Madras
2. (2 Mark) Consider a dataset with only one attribute(categorical). Suppose, there are 8 un-
ordered values in this attribute, how many possible combinations are needed to find the best
split-point for building the decision tree classifier?
(a) 511
(b) 1023
(c) 512
(d) 127
Sol. (d)
Suppose we have q unordered values; the total possible splits would be 2q−1 − 1. Thus, in our
case, it will be 27 − 1 = 127.
3. (2 mark) Having built a decision tree, we are using reduced error pruning to reduce the size of
the tree. We select a node to collapse. For this particular node, on the left branch, there are
three training data points with the following outputs: 5, 7, 9.6, and for the right branch, there
are four training data points with the following outputs: 8.7, 9.8, 10.5, 11. The average value
of the outputs of data points denotes the response of a branch. The original responses for data
points along the two branches (left & right respectively) were response left and, response right
and the new response after collapsing the node is response new. What are the values for
response left, response right and response new (numbers in the option are given in the same
order)?
(a) 9.6, 11, 10.4
(b) 7.2; 10; 8.8
(c) 5, 10.5, 15
(d) Depends on the tree height.
Sol. (b)
4. (1 Mark) Which of the following is a good strategy for reducing the variance in a decision tree?
1
(a) If improvement of taking any split is very small, don’t make a split. (Early Stopping)
(b) Stop splitting a leaf when the number of points is less than a set threshold K.
(c) Stop splitting all leaves in the decision tree when any one leaf has less than a set threshold
K points.
(d) None of the Above.
Sol. (b)
5. (1 Mark) Which of the following statements about multiway splits in decision trees with cate-
gorical features is correct?
(a) They always result in deeper trees compared to binary splits
(b) They always provide better interpretability than binary splits
(c) They can lead to overfitting when dealing with high-cardinality categorical features
(d) They are computationally less expensive than binary splits for all categorical features
Sol. (c)
6. (1 Mark) Which of the following statements about imputation in data preprocessing is most
accurate?
(a) Mean imputation is always the best method for handling missing numerical data
(b) Imputation should always be performed after splitting the data into training and test sets
(c) Missing data is best handled by simply removing all rows with any missing values
(d) Multiple imputation typically produces less biased estimates than single imputation meth-
ods
Sol. (d)
Which among the following split-points for feature2 would give the best split according to the
misclassification error?
(a) 186.5
2
(b) 188.6
(c) 189.2
(d) 198.1
Sol. (c)
3
Introduction to Machine Learning
Week 7
Prof. B. Ravindran, IIT Madras
1
(a) Precision: 0.81, Recall: 0.85, Accuracy: 0.83
(b) Precision: 0.85, Recall: 0.81, Accuracy: 0.85
(c) Precision: 0.80, Recall: 0.85, Accuracy: 0.82
(d) Precision: 0.85, Recall: 0.85, Accuracy: 0.80
Sol. (a) -
Precision = T PT+F
P 85 85
P = 85+20 = 105 ≈ 0.81
Recall = T PT+F
P 85 85
N = 85+15 = 100 = 0.85
Accuracy = T P +T N +F P +F N = 85+80
T P +T N 165
200 = 200 = 0.825 ≈ 0.83
5. (1 Mark) AUC for your newly trained model is 0.5. Is your model prediction completely
random?
(a) Yes
(b) No
(c) ROC curve is needed to derive this conclusion
(d) Cannot be determined even with ROC
Sol. (c) - An AUC of 0.5 suggests that the model is either making random predictions,
predicting all instances as a single class, or making systematically incorrect predictions that
could be corrected by inverting its outputs.
6. (1 Mark) You are building a model to detect cancer. Which metric will you prefer for evaluating
your model?
(a) Accuracy
(b) Sensitivity
(c) Specificity
(d) MSE
Sol. (b) - In medical application, FP is the most important (which sensitivity captures)
7. (1 Mark) You have 2 binary classifiers A and B. A has accuracy=0% and B has accuracy=50%.
Which classifier is more useful?
(a) A
(b) B
(c) Both are good
(d) Cannot say
Sol. (a) - Flip the labels and get 100% accuracy!
8. (1 Mark) You have a special case where your data has 10 classes and is sorted according to
target labels. You attempt 5-fold cross validation by selecting the folds sequentially. What
can you say about your resulting model?
2
(a) It will have 100% accuracy.
(b) It will have 0% accuracy.
(c) It will have close to perfect accuracy.
(d) Accuracy will depend on the compute power available for training.
Sol. (b) - The training and test sets are partitioned in a way that some classes are only
present in the test set. This means the classifier will never learn about these classes and
therefore cannot predict them
3
Introduction to Machine Learning
Week 8
Prof. B. Ravindran, IIT Madras
3. (1 Mark) In a random forest, if T (number of features considered at each split) is set equal to
P (total number of features), how does this compare to standard bagging with decision trees?
Soln. A - Random Forests differ from standard bagging by sampling a sub-set of features at
every node
4. (1 Mark) Multiple Correct: Consider the following graphical model, which of the following
are true about the model? (multiple options may be correct)
1
(a) d is independent of b when c is known
(b) a is independent of c when e is known
(c) a is independent of b when e is known
(d) a is independent of b when c is known
6. (2 Marks) A single box is randomly selected from a set of three. Two pens are then drawn
from this container. These pens happen to be blue and green colored. What is the probability
that the chosen box was Box A?
(a) 37/18
(b) 15/56
2
(c) 18/37
(d) 56/15
Soln. C -
Probability of choosing one box P (A) = P (B) = P (C) = 1/3. Here the event (E) is choosing
3 2
the green and blue balls from the random box. Therefore, P(E | A) = C61C∗ 2C1 = 6/15 = 2/5
2
C1 ∗1 C1
P(E | B) = 5C
= 2/10 = 1/5
2
4
C1 ∗2 C1 8
P(E | C) = 9C
= = 2/9
2 72/2
P (A | E) = P (E | A)/[P (E | A) + P (E | B) + P (E | C)]
2/5
=
(2/5) + (1/5) + (2/9)
= 18/37
7. (1 Mark) State True or False: The primary advantage of the tournament approach in
multiclass classification is its effectiveness even when using weak classifiers.
(a) True
(b) False
8. (1 Mark) A data scientist is using a Naive Bayes classifier to categorize emails as either “spam”
or “not spam”. The features used for classification include:
• Number of recipients (To, Cc, Bcc)
• Presence of “spam” keywords (e.g., ”URGENT”, ”offer”, ”free”)
• Time of day the email was sent
• Length of the email in words
Which of the following scenarios, if true, is most likely to violate the key assumptions of Naive
Bayes and potentially impact its performance?
Soln. D - This scenario violates the Naive Bayes assumption of feature independence, as it
the features are dependent on each other.
3
9. Consider the two statements:
Statement 1: Bayesian Networks are inherently structured as Directed Acyclic Graphs (DAGs).
Statement 2: Each node in a bayesian network represents a random variable, and each edge
represents conditional dependence.
Which of these are true?
(a) Both the statements are True.
(b) Statement 1 is true, and statement 2 is false.
(c) Statement 1 is false, and statement 2 is true.
(d) Both the statements are false.
Soln. A - Bayesian Networks are structured as DAGs and each node represents a random
variable, with edges indicating conditional dependencies.
4
Introduction to Machine Learning
Week 9
Prof. B. Ravindran, IIT Madras
1. (2 marks) In the undirected graph given below, how many terms will be there in its potential
function factorization?
(a) 7
(b) 3
(c) 5
(d) 9
(e) None of the above
A B C
D E
1
Soln. A - Conditioning on B blocks the only path A-B-C between A and C, making A and C
conditionally independent given B.
A B E
C D
In the undirected graph given above, which nodes are conditionally independent of each other
given C? Select all that apply.
(a) A, E
(b) B, F
(c) A, D
(d) B, D
(e) None of the above
Soln. E - None of the paths between the given pairs are blocked when C is conditioned on.
4. (1 Marks) Consider the following statements about Hidden Markov Models (HMMs):
I. The ”Hidden” in HMM refers to the fact that the state transition probabilities are un-
known.
II. The ”Markov” property means that the current state depends only on the previous state.
III. The ”Hidden” aspect relates to the underlying state sequence that is not directly observ-
able.
IV. The ”Markov” in HMM indicates that the model uses matrix operations for calculations.
Which of the statements correctly describe the ”Hidden” and ”Markov” aspects of Hidden
Markov Models?
(a) I and II
(b) I and IV
(c) II and III
(d) III and IV
2
5. (2 marks) For the given graphical model, what is the optimal variable elimination order when
trying to calculate P(E=e)?
A B C D E
(a) A, B, C, D
(b) D, C, B, A
(c) A, D, B, C
(d) D, A, C, A
Soln. A
XXXX
P (E = e) = P (a, b, c, d, e)
d c b a
XXXX
P (E = e) = P (a)P (b|a)P (c|b)P (d|c)P (e|d)
d c b a
X X X X
P (E = e) = P (e|d) P (d|c) P (c|b) P (a)P (b|a)
d c b a
(a) I and II
(b) II and III
(c) I, II, and IV
(d) I, III, and IV
(e) II, III, and IV
7. (1 Mark) HMMs are used for finding these. Select all that apply.
(a) Probability of a given observation sequence
(b) All possible hidden state sequences given an observation sequence
(c) Most probable observation sequence given the hidden states
(d) Most probable hidden states given the observation sequence
Soln. A, D - Refer to the lectures.
3
Introduction to Machine Learning
Week 10
Prof. B. Ravindran, IIT Madras
(1, 1), (1, 2), (2, 1), (2, 2), (3, 3),
(8, 8), (8, 9), (9, 8), (9, 9), (10, 10)
Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?
(a) 4
(b) 5
(c) 8
(d) 10
Soln. C - To be a core point, it needs at least 3 points (including itself) within ϵ = 1.5
distance. There are 8 core points: (1,1), (1,2), (2,1), (2,2) from first group and (8,8), (8,9),
(9,8), (9,9) from second group.
3. (1 Mark) In BIRCH, using number of points N, sum of points SUM and sum of squared points
SS, we can determine the centroid and radius of the combination of any two clusters A and
B. How do you determine the radius of the combined cluster? (In terms of N,SUM and SS
of both two clusters A and B)
Radius of a cluster is given by:
r
SS SU M 2
Radius = −( )
N N
Note: We use the following definition of radius from the BIRCH paper: ”Radius is the average
distance from the member points to the centroid.”
q
(a) Radius = SS SU MA 2 SSB
NA − ( NA ) + NB − ( NB )
A SU MB 2
q q
(b) Radius = SS SU MA 2
NA − ( NA ) +
A SSB SU MB 2
NB − ( NB )
1
q
SSA +SSB
(c) Radius = NA +NB − ( SU MA +SU MB 2
NA +NB )
q
SSA SSB
(d) Radius = NA + NB − ( SU MA +SU MB 2
NA +NB )
Soln. C
CFA+B = CFA + CFB
Therefore,
NA+B = NA + NB
SU MA+B = SU MA + SU MB
SSA+B = SSA + SSB
Replace the above in the formula of radius given in the question to get the formula for the
combined cluster.
P1 P2 P3 P4 P5 P6
P1 0 3 8 9 5 4
P2 3 0 9 8 10 9
P3 8 9 0 1 6 7
P4 9 8 1 0 7 8
P5 5 10 6 7 0 2
P6 4 9 7 8 2 0
2
Soln. B
Step 1: Connect closest pair of points. Closest pairs are:
d(P3, P4) = 1
d(P5, P6) = 2
d(P1, P2) = 3
Step 2: Connect clusters with single link. The cluster pair to combine is bolded:
d({P1, P2}, {P3, P4}) = 8
d({P1, P2}, {P5, P6}) = 4
d({P5, P6}, {P3, P4}) = 6
3
6. (1 Mark) For the pairwise distance matrix given in the previous question, which of the following
shows the hierarchy of clusters created by the complete link clustering algorithm.
Soln. B
Step 1: Same as previous question
Step 2: Connect clusters with complete link. The cluster pair to combine is bolded:
d({P1, P2}, {P3, P4}) = 9
d({P1, P2}, {P5, P6}) = 10
d({P5, P6}, {P3, P4}) = 8
4
Introduction to Machine Learning
Week 11
Prof. B. Ravindran, IIT Madras
1. (1 Mark) What constraint must be satisfied by the mixing coefficients (πk ) in a GMM?
(a) πk > 0 ∀ k
P
(b) k πk = 1
(c) πk < 1 ∀ k
P
(d) k πk = 0
3. (1 Mark) Why might the EM algorithm for GMMs converge to a local maximum rather than
the global maximum of the likelihood function?
(a) The algorithm is not guaranteed to increase the likelihood at each iteration
(b) The likelihood function is non-convex
(c) The responsibilities are incorrectly calculated
(d) The number of components K is too small
Soln. B - Refer to the lectures
4. (1 Mark) What does soft clustering mean in GMMs?
(a) There may be samples that are outside of any cluster boundary.
(b) The updates during maximum likelihood are taken in small steps, to guarantee conver-
gence.
(c) It restricts the underlying distribution to be gaussian.
(d) Samples are assigned probabilities of belonging to a cluster.
1
1
(d) πk = k
(a) EM will always return the same solution which may not be optimal
(b) EM will always return the same solution which must be optimal
(c) The solution depends on the initialization
Soln. C - Refer to the lectures
7. (1 Mark) True or False: Iterating between the E-step and M-step of EM algorithms always
converges to a local optimum of the likelihood.
(a) True
(b) False
2
Introduction to Machine Learning
Week 12
Prof. B. Ravindran, IIT Madras
1
5. (1 Mark) Imagine you’re designing a robot that needs to navigate through a maze to reach a
target. Which reward scheme would be most effective in teaching the robot to find the shortest
path?
(a) +5 for reaching the target, -1 for hitting a wall
(b) +5 for reaching the target, -0.1 for every second that passes before the robot reaches the
target.
(c) +5 for reaching the target, -0.1 for every second that passes before the robot reaches the
target, +1 for hitting a wall.
(d) -5 for reaching the target, +0.1 for every second that passes before the robot reaches the
target.
Soln. B - The +5 reward for reaching the target encourages goal achievement, while the -0.1
penalty for each second promotes finding the shortest path. Omitting rewards for hitting walls
as question has nothing in this regard.
For the rest of the questions, we will follow a simplistic game and see how a Reinforcement
Learning agent can learn to behave optimally in it.
This is our game:
At the start of the game, the agent is on the Start state and can choose to move left or right
at each turn. If it reaches the right end(RE), it wins and if it reaches the left end(LE), it loses.
Because we love maths so much, instead of saying the agent wins or loses, we will say that the
agent gets a reward of +1 at RE and a reward of -1 at LE. Then the objective of the agent is
simply to maximum the reward it obtains!
6. (1 Mark) For each state, we define a variable that will store its value. The value of the state
will help the agent determine how to behave later. First we will learn this value.
2
(a) 1
(b) 0.9
(c) 0.81
(d) 0
Soln. B -
V (X4) = 0.9 × max(V (X3), V (RE))
V (S) = 0.9 × max(0, +1) = 0.9