Wa0030.

Introduction to Machine Learning
Week 1
Prof. B. Ravindran, IIT Madras
1. (2 Marks) Which of the following are supervised learning problems (Multiple Correct)?
(a) Clustering Spotify users based on their listening history
(b) Weather forecast using data collected by a satellite
(c) Predicting tuberculosis using patient’s chest X-Ray
(d) Training a humanoid to walk using a reward system
Sol. b and c
2. (2 Marks) Which of the following are regression tasks (Multiple Correct)?
(a) Predicting the outcome of an election
(b) Predicting the weight of a giraffe based on its weight
(c) Predicting the emotion conveyed by a sentence
(d) Identifying abnormal data points
Sol. b
3. (2 Marks) Which of the following are classification tasks (Multiple Correct)?
(a) Predicting the outcome of an election
(b) Predicting the weight of a giraffe based on its weight
(c) Predicting the emotion conveyed by a sentence
(d) Identifying abnormal data points
Sol. a, and c
(Common Data for Questions 4 and 5)

Here is a 2-dimensional plot showing two functions that classify data points into two classes.
The red points belong to one class, and the green points belong to another. The dotted blue
line (F1) and dashed pink line (F2) represent the two trained functions.
1
4. (1 Mark) Which of the two functions overfit the training data?
(a) Both functions F1 & F2

(b) Function F1
(c) Function F2
(d) None of them
Sol. c
5. (1 Mark) Which of the following 2 functions will yield higher training error?
(a) Function F1
(b) Function F2
(c) Both functions F1 & F2 will have the same training error
(d) Can not be determined
Sol. a
6. (1 Mark) What does the term ‘policy’ refer to in reinforcement learning?
(a) A set of rules governing the environment

(b) The reward function
(c) The initial state of the environment
(d) The strategy the agent follows to choose actions
Sol. d
7. (1 Mark) Given the following dataset, for k = 3, use KNN regression to find the prediction for
a new data-point (2,3) (Use Euclidean distance measure for finding closest points)
X1 X2 Y
2 5 3.4
5 5 5
3 3 3
6 3 4.5
2 2 2
4 1 2.8
(a) 2.0
(b) 2.6
(c) 2.8
(d) 3.2
Sol. c: The closest k points are (2,5), (3,3) and (2,2). Their corresponding labels averaged is
(3.4 + 3 + 2) / 3 = 2.8
8. (1 Mark) For any given dataset, comment on the bias of K-nearest classifiers upon increasing
the value of K.
2
(a) The bias of the classifier decreases
(b) The bias of the classifier does not change
(c) The bias of the classifier increases
Sol. c: Refer to lecture

9. (1 Mark) Bias and variance are given by:
(a) E[fˆ(x)] − f (x), E (E[fˆ(x)] − fˆ(x))2

2
(b) E[fˆ(x)] − f (x), E (E[fˆ(x)] − fˆ(x))

(c) (E[fˆ(x)] − f (x))2 , E (E[fˆ(x)] − fˆ(x))2

2
(d) (E[fˆ(x)] − f (x))2 , E (E[fˆ(x)] − fˆ(x))

Sol. a
10. (1 Mark) Which of the following statements are FALSE regarding bias and variance?
(a) Models which overfit have a high bias

(b) Models which overfit have a low bias
(c) Models which underfit have a high variance
(d) Models which underfit have a low variance
Sol. a and c
3
Week 2
1. (1 Mark) State True or False: Typically, linear regression tend to underperform compared
to k-nearest neighbor algorithms when dealing with high-dimensional input spaces.
(a) True
(b) False
Sol. b
2. (2 Marks) Given the following dataset, find the uni-variate regression function that best fits
the dataset.
X Y
2 5.5
3 6.5
4 9
10 18.5
(a) f (x) = 1 × x + 4
(b) f (x) = 1 × x + 5
(c) f (x) = 1.5 × x + 3
(d) f (x) = 2 × x + 1
Soln. C
3. (1 Mark) Given a training data set of 500 instances, with each input instance having 6 di-
mensions and each output being a scalar value, the dimensions of the design matrix used in
applying linear regression to this data is
(a) 500 × 6
(b) 500 × 7
(c) 500 × 62
(d) None of the above
Soln. B
4. (1 Mark) Assertion A: Binary encoding is usually preferred over One-hot encoding to repre-
sent categorical data (eg. colors, gender etc)
Reason R: Binary encoding is more memory efficient when compared to One-hot encoding
(a) Both A and R are true and R is the correct explanation of A

(b) Both A and R are true but R is not the correct explanation of A
(c) A is true but R is false
1
(d) A is false but R is true
Soln. D
5. (1 Mark) Select the TRUE statement.
(a) Subset selection methods are more likely to improve test error by only focussing on the
most important features and by reducing variance in the fit.
(b) Subset selection methods are more likely to improve train error by only focussing on the
most important features and by reducing variance in the fit.
(c) Subset selection methods are more likely to improve both test and train error by focussing
on the most important features and by reducing variance in the fit.
(d) Subset selection methods don’t help in performance gain in any way.
Sol. a
6. (1 Mark) Rank the 3 subset selection methods in terms of computational efficiency:
(a) Forward stepwise selection, best subset selection, and forward stagewise regression.
(b) forward stepwise selection, forward stagewise regression and best subset selection.
(c) Best subset selection, forward stagewise regression and forward stepwise selection.
(d) Best subset selection, forward stepwise selection and forward stagewise regression.
Sol. b
7. (1 Mark) Choose the TRUE statements from the following: (Multiple correct choice)
(a) Ridge regression since it reduces the coefficients of all variables, makes the final fit a lot
more interpretable.
(b) Lasso regression since it doesn’t deal with a squared power is easier to optimize than
ridge regression.
(c) Ridge regression has a more stable optimization than lasso regression.
(d) Lasso regression is better suited for interpretability than ridge regression.
Sol. c, d
8. (2 Marks) Which of the following statements are TRUE? Let xi be the i-th datapoint in a
dataset of N points. Let v represent the first principal component of the dataset. (Multiple
answer questions)
PN
(a) v = arg max i=1 (v T xi )2 s.t. |v| = 1
PN
(b) v = arg min i=1 (v T xi )2 s.t. |v| = 1
(c) Scaling at the start of performing PCA is done just for better numerical stability and
computational benefits but plays no role in determining the final principal components
of a dataset.
(d) The resultant vectors obtained when performing PCA on a dataset can vary based on the
scale of the dataset.
Soln. A and D
2
Week 3
1. (1 Mark) For a two-class problem using discriminant functions (δk - discriminant function for
class k), where is the separating hyperplane located?
(a) Where δ1 > δ2

(b) Where δ1 < δ2
(c) Where δ1 = δ2
(d) Where δ1 + δ2 = 1
Soln. C
2. (1 Mark) Given the following dataset consisting of two classes, A and B, calculate the prior
probability of each class.
Feature 1 Class
2.3 A
1.8 A
3.2 A
2.7 B
3.0 A
2.1 A
1.9 B
2.4 B
What are the prior probabilities of class A and class B?
(a) P (A) = 0.5, P (B) = 0.5

(b) P (A) = 0.625, P (B) = 0.375
(c) P (A) = 0.375, P (B) = 0.625
(d) P (A) = 0.6, P (B) = 0.4
Soln. B
3. (1 Mark) In a 3-class classification problem using linear regression, the output vectors for three
data points are [0.8, 0.3, -0.1], [0.2, 0.6, 0.2], and [-0.1, 0.4, 0.7]. To which classes would these
points be assigned?
(a) 1, 2, 1
(b) 1, 2, 2
(c) 1, 3, 2
(d) 1, 2, 3
1
Soln. D
4. (1 Mark) If you have a 5-class classification problem and want to avoid masking using polyno-
mial regression, what is the minimum degree of the polynomial you should use?
(a) 3
(b) 4
(c) 5
(d) 6
Soln. B
5. (1 Mark) Consider a logistic regression model where the predicted probability for a given data
point is 0.4. If the actual label for this data point is 1, what is the contribution of this data
point to the log-likelihood?
(a) -1.3219
(b) -0.9163
(c) +1.3219
(d) +0.9163
Soln. B
6. (1 Mark) What additional assumption does LDA make about the covariance matrix in com-
parison to the basic assumption of Gaussian class conditional density?
(a) The covariance matrix is diagonal
(b) The covariance matrix is identity
(c) The covariance matrix is the same for all classes
(d) The covariance matrix is different for each class
Soln. C
7. (1 Mark) What is the shape of the decision boundary in LDA?
(a) Quadratic
(b) Linear
(c) Circular
Soln. B
2 2
8. (1 Mark) For two classes C1 and C2 with within-class variances σw1 = 1 and σw2 = 4 respec-
′ ′
tively, if the projected means are µ1 = 1 and µ2 = 3, what is the Fisher criterion J(w)?
(a) 0.5
(b) 0.8
(c) 1.25
2
(d) 1.5
Soln. B
Sw = σw12 2
+ σw2 = 1 + 4 = 5 Sb = (µ′2 − µ′1 )2 = (3 − 1)2 = 4 J(w) = SSwb = 45 = 0.8

2 5
9. (2 Marks) Given two classes C1 and C2 with means µ1 = and µ2 = respectively, what
3 7
is the direction vector w for LDA when the within-class covariance matrix Sw is the identity
matrix I?

4
(a)
3

5
(b)
7

0.7
(c)
0.7

0.6
(d)
0.8
Soln. D
Sw ∝ µ1 − µ2
3
Week 4
f (x)
1. (1 Mark) In the context of the perceptron learning algorithm, what does the expression ||f ′ (x)||
represent?
(a) The gradient of the hyperplane

(b) The signed distance to the hyperplane
(c) The normal vector to the hyperplane
(d) The misclassification error
Soln. B
2. (1 Mark) Why do we normalize by ∥β∥ (the magnitude of the weight vector) in the SVM
objective function?
(a) To ensure the margin is independent of the scale of β
(b) To minimize the computational complexity of the algorithm
(c) To prevent overfitting
(d) To ensure the bias term is always positive
Soln. A
3. (1 Mark) Which of the following is NOT one of the KKT conditions for optimization problems
with inequality constraints?
Pm Pp
(a) Stationarity: ∇f (x∗ ) + i=1 λi ∇gi (x∗ ) + j=1 νj ∇hj (x∗ ) = 0
(b) Primal feasibility: gi (x∗ ) ≤ 0 for all i, and hj (x∗ ) = 0 for all j
(c) Dual feasibility: λi ≥ 0 for all i
(d) Convexity: The objective function f (x) must be convex
Soln. D
4. (1 Mark) Consider the 1 dimensional dataset:
x y
-1 1
0 -1
2 1
(Note: x is the feature and y is the output)
1
State true or false: The dataset becomes
linearly separable after using basis expansion with
1
the following basis function ϕ(x) = 3
x
(a) True
(b) False
Soln. B

′ 1 ′ 1 ′ 1
After applying basis expansion, x1 = , x2 = , and x3 = . Despite the basis
−1 0 8
expansion, it still remains linearly inseparable.
5. (1 Mark) Consider a polynomial kernel of degree d operating on p-dimensional input vectors.

What is the dimension of the feature space induced by this kernel?
(a) p × d
(b) (p + 1) × d
(c) p+d

d
(d) pd
Soln. C
6. (1 Mark) State True or False: For any given linearly separable data, for any initialization,
both SVM and Perceptron will converge to the same solution.
(a) True
(b) False
Soln. B
For Q7,8: Kindly download the modified version of Iris dataset from this link.
Available at: (https://goo.gl/vchhsd)
The dataset contains 150 points, and each input point has 4 features and belongs to one among
three classes. Use the first 100 points as the training data and the remaining 50 as test data.
In the following questions, to report accuracy, use the test dataset. You can round off the
accuracy value to the nearest 2-decimal point number. (Note: Do not change the order of data
points.)
7. (2 marks) Train a Linear perceptron classifier on the modified iris dataset. We recommend
using sklearn. Use only the first two features for your model and report the best classification
accuracy for l1 and l2 penalty terms.
(a) 0.91, 0.64

(b) 0.88, 0.71
(c) 0.71, 0.65
(d) 0.78, 0.64
2
Sol. (d)
Following code will give the desired result.
>>clf = Perceptron(penalty=”l1”).fit(X[0:100,0:2],Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])
>>clf = Perceptron(penalty=”l2”).fit(X[0:100,0:2],Y[0:100])
>>clf.score(X[100:,0:2], Y[100:])
8. (2 marks) Train a SVM classifier on the modified iris dataset. We recommend using sklearn.
Use only the first three features. We encourage you to explore the impact of varying different
hyperparameters of the model. Specifically try different kernels and the associated hyperpa-
rameters. As part of the assignment train models with the following set of hyperparameters
RBF-kernel, gamma = 0.5, one-vs-rest classifier, no-feature-normalization.
Try C = 0.01, 1, 10. For the above set of hyperparameters, report the best classification accu-
racy.
(a) 0.98
(b) 0.88
(c) 0.99
(d) 0.92
Sol. (a)
Following code will give the desired result.
>>clf = svm.SVC( C=1.0, kernel=’rbf’, decision function shape=’ovr’, gamma = 0.5)).fit(X[0:100,0:3],
Y[0:100])
>>clf.score(X[100:,0:3], Y[100:])
3
Week 5
1. (1 Mark) Given a 3 layer neural network which takes in 10 inputs, has 5 hidden units and
outputs 10 outputs, how many parameters are present in this network?
(a) 115
(b) 500
(c) 25
(d) 100
Soln. A
2. (1 Mark) Recall the XOR(tabulated below) example from class where we did a transformation
of features to make it linearly separable. Which of the following transformations can also
work?
X1 X2 Y
-1 -1 -1
1 -1 1
-1 1 1
1 1 -1
(a) Rotating x1 and x2 by a fixed angle.

(b) Adding a third dimension z = x ∗ y
(c) Adding a third dimension z = x2 + y 2
Sol. (b)
3. We use several techniques to ensure the weights of the neural network are small (such as
random initialization around 0 or regularisation). What conclusions can we draw if weights of
our ANN are high?
(a) Model has overfitted.
(b) It was initialized incorrectly.
(c) At least one of (a) or (b).
(d) None of the above.
Sol. (d)
Overfitting may be because of high weights but the two are not always associated.
1
4. (1 Mark) In a basic neural network, which of the following is generally considered a good
initialization strategy for the weights?
(a) Initialize all weights to zero
(b) Initialize all weights to a constant non-zero value (e.g., 0.5)
(c) Initialize weights randomly with small values close to zero
(d) Initialize weights with large random values (e.g., between -10 and 10)
Soln. C
5. (1 Mark) Which of the following is the primary reason for rescaling input features before
passing them to a neural network?
(a) To increase the complexity of the model
(b) To ensure all input features contribute equally to the initial learning process
(c) To reduce the number of parameters in the network
(d) To eliminate the need for activation functions
Soln. B
6. (1 Mark) In the Bayesian approach to machine learning, we often use the formula: P (θ|D) =
P (D|θ)P (θ)
P (D) Where θ represents the model parameters and D represents the observed data.
Which of the following correctly identifies each term in this formula?
(a) P (θ|D) is the likelihood, P (D|θ) is the posterior, P (θ) is the prior, P (D) is the evidence
(b) P (θ|D) is the posterior, P (D|θ) is the likelihood, P (θ) is the prior, P (D) is the evidence
(c) P (θ|D) is the evidence, P (D|θ) is the likelihood, P (θ) is the posterior, P (D) is the prior
(d) P (θ|D) is the prior, P (D|θ) is the evidence, P (θ) is the likelihood, P (D) is the posterior
Soln. B
7. (1 Mark) Why do we often use log-likelihood maximization instead of directly maximizing the
likelihood in statistical learning?
(a) Log-likelihood provides a different optimal solution than likelihood maximization
(b) Log-likelihood is always faster to compute than likelihood
(c) Log-likelihood turns products into sums, making computations easier and more numeri-
cally stable
(d) Log-likelihood allows us to avoid using probability altogether
Soln. C
8. (1 Mark) In machine learning, if you have an infinite amount of data, but your prior distribution
is incorrect, will you still converge to the right solution?
(a) Yes, with infinite data, the influence of the prior becomes negligible, and you will converge
to the true underlying solution.
(b) No, the incorrect prior will always affect the convergence, and you may not reach the true
solution even with infinite data.
2
(c) It depends on the type of model used; some models may still converge to the right solution,
while others might not.
(d) The convergence to the right solution is not influenced by the prior, as infinite data will
always lead to the correct solution regardless of the prior.
Soln. A
9. Statement: Threshold function cannot be used as activation function for hidden layers.
Reason: Threshold functions do not introduce non-linearity.
(a) Statement is true and reason is false.

(b) Statement is false and reason is true.
(c) Both are true and the reason explains the statement.
(d) Both are true and the reason does not explain the statement.
Sol. (a)
The reason is that threshold function is non-differentiable so we will not be able to calculate
gradient for backpropagation.
10. Choose the correct statement (multiple may be correct):
(a) MLE is a special case of MAP when prior is a uniform distribution.

(b) MLE acts as regularisation for MAP.
(c) MLE is a special case of MAP when prior is a beta disrubution .
(d) MAP acts as regularisation for MLE.
Sol. (a), (d)

Ref. lecture
3
Week 6
1. (1 Mark) Entropy for a 90-10 split between two classes is:

(a) 0.469
(b) 0.195
(c) 0.204
Sol. (a)
2. (2 Mark) Consider a dataset with only one attribute(categorical). Suppose, there are 8 un-
ordered values in this attribute, how many possible combinations are needed to find the best
split-point for building the decision tree classifier?
(a) 511
(b) 1023
(c) 512
(d) 127
Sol. (d)
Suppose we have q unordered values; the total possible splits would be 2q−1 − 1. Thus, in our
case, it will be 27 − 1 = 127.
3. (2 mark) Having built a decision tree, we are using reduced error pruning to reduce the size of
the tree. We select a node to collapse. For this particular node, on the left branch, there are
three training data points with the following outputs: 5, 7, 9.6, and for the right branch, there
are four training data points with the following outputs: 8.7, 9.8, 10.5, 11. The average value
of the outputs of data points denotes the response of a branch. The original responses for data
points along the two branches (left & right respectively) were response left and, response right
and the new response after collapsing the node is response new. What are the values for
response left, response right and response new (numbers in the option are given in the same
order)?
(a) 9.6, 11, 10.4
(b) 7.2; 10; 8.8
(c) 5, 10.5, 15
(d) Depends on the tree height.
Sol. (b)
4. (1 Mark) Which of the following is a good strategy for reducing the variance in a decision tree?
1
(a) If improvement of taking any split is very small, don’t make a split. (Early Stopping)
(b) Stop splitting a leaf when the number of points is less than a set threshold K.
(c) Stop splitting all leaves in the decision tree when any one leaf has less than a set threshold
K points.
(d) None of the Above.
Sol. (b)
5. (1 Mark) Which of the following statements about multiway splits in decision trees with cate-
gorical features is correct?
(a) They always result in deeper trees compared to binary splits
(b) They always provide better interpretability than binary splits
(c) They can lead to overfitting when dealing with high-cardinality categorical features
(d) They are computationally less expensive than binary splits for all categorical features
Sol. (c)
6. (1 Mark) Which of the following statements about imputation in data preprocessing is most
accurate?
(a) Mean imputation is always the best method for handling missing numerical data
(b) Imputation should always be performed after splitting the data into training and test sets
(c) Missing data is best handled by simply removing all rows with any missing values
(d) Multiple imputation typically produces less biased estimates than single imputation meth-
ods
Sol. (d)
7. (2 Marks) Consider the following dataset:
feature1 feature2 output

18.3 187.6 a
14.7 184.9 a
19.4 193.3 a
17.9 180.5 a
19.1 189.1 a
17.6 191.9 b
19.9 190.2 b
17.3 198.6 b
18.7 182.6 b
15.2 187.3 b
Which among the following split-points for feature2 would give the best split according to the
misclassification error?
(a) 186.5
2
(b) 188.6
(c) 189.2
(d) 198.1
Sol. (c)
3
Week 7
1. (1 Mark) Define active learning:

(a) A learning approach where the algorithm passively receives all training data at once
(b) A technique where the model learns from its own predictions without human intervention
(c) An iterative learning process where the model selects the most informative data points
for labeling
(d) A method where the model randomly selects data points for training to reduce bias
Sol. (c) - Refer to the lectures
2. (2 Mark) Given 100 distinct data points, if you sample 100 times with replacement, what is
the expected number of distinct points you will obtain?
(a) Approximately 50
(b) Approximately 63
(c) Exactly 100
(d) Approximately 37
Sol. (b) -
99 100
Probability of not selecting a point in 100 tries = 100
99 100
Probability of selecting a point atleast once in 100 tries = 1 - 100
99 100
P
Expectation = x × P (x) = 100(1 − ( 100 ) ) ≈ 63
3. (1 Mark) What is the key difference between bootstrapping and cross-validation?
(a) Bootstrapping uses the entire dataset for training, while cross-validation splits the data
into subsets
(b) Cross-validation allows replacement, while bootstrapping does not
(c) Bootstrapping creates multiple samples with replacement, while cross-validation creates
subsets without replacement
(d) Cross-validation is used for model selection, while bootstrapping is only used for uncer-
tainty estimation
Sol. (c) - Refer to the lectures
4. (2 Marks) Consider the following confusion matrix for a binary classification problem:
Predicted Positive Predicted Negative

Actual Positive 85 15
Actual Negative 20 80
What are the precision, recall, and accuracy of this classifier?
1
(a) Precision: 0.81, Recall: 0.85, Accuracy: 0.83
(b) Precision: 0.85, Recall: 0.81, Accuracy: 0.85
(c) Precision: 0.80, Recall: 0.85, Accuracy: 0.82
(d) Precision: 0.85, Recall: 0.85, Accuracy: 0.80
Sol. (a) -
Precision = T PT+F
P 85 85
P = 85+20 = 105 ≈ 0.81
Recall = T PT+F
P 85 85
N = 85+15 = 100 = 0.85
Accuracy = T P +T N +F P +F N = 85+80
T P +T N 165
200 = 200 = 0.825 ≈ 0.83
5. (1 Mark) AUC for your newly trained model is 0.5. Is your model prediction completely
random?
(a) Yes
(b) No
(c) ROC curve is needed to derive this conclusion
(d) Cannot be determined even with ROC
Sol. (c) - An AUC of 0.5 suggests that the model is either making random predictions,
predicting all instances as a single class, or making systematically incorrect predictions that
could be corrected by inverting its outputs.
6. (1 Mark) You are building a model to detect cancer. Which metric will you prefer for evaluating
your model?
(a) Accuracy
(b) Sensitivity
(c) Specificity
(d) MSE
Sol. (b) - In medical application, FP is the most important (which sensitivity captures)
7. (1 Mark) You have 2 binary classifiers A and B. A has accuracy=0% and B has accuracy=50%.
Which classifier is more useful?
(a) A
(b) B
(c) Both are good
(d) Cannot say
Sol. (a) - Flip the labels and get 100% accuracy!
8. (1 Mark) You have a special case where your data has 10 classes and is sorted according to
target labels. You attempt 5-fold cross validation by selecting the folds sequentially. What
can you say about your resulting model?
2
(a) It will have 100% accuracy.
(b) It will have 0% accuracy.
(c) It will have close to perfect accuracy.
(d) Accuracy will depend on the compute power available for training.
Sol. (b) - The training and test sets are partitioned in a way that some classes are only
present in the test set. This means the classifier will never learn about these classes and
therefore cannot predict them
3
Week 8
1. (1 Mark) In Bagging technique, the reduction of variance is maximum if:

(a) The correlation between the classifiers is minimum
(b) Does not depend on the correlation between the classifiers
(c) Similar features are used in all classifiers
(d) The number of classifiers in the ensemble is minimized
Soln. A - This ensures diverse predictions that effectively average out errors
2. (1 Mark) If using squared error loss in gradient boosting for a regression problem, what does
the gradient correspond to?
(a) The absolute error
(b) The log-likelihood
(c) The residual error
(d) The exponential loss
Soln. C - ∆(y − f (x; w))2 = 2(y − f (x; w))∆f (x; w)
3. (1 Mark) In a random forest, if T (number of features considered at each split) is set equal to
P (total number of features), how does this compare to standard bagging with decision trees?
(a) It’s exactly the same as standard bagging

(b) It will always perform better than standard bagging
(c) It will always perform worse than standard bagging
Soln. A - Random Forests differ from standard bagging by sampling a sub-set of features at
every node
4. (1 Mark) Multiple Correct: Consider the following graphical model, which of the following
are true about the model? (multiple options may be correct)
1
(a) d is independent of b when c is known
(b) a is independent of c when e is known
(c) a is independent of b when e is known
(d) a is independent of b when c is known
Soln. A, D - Refer to the lectures

5. (1 Mark) Consider the Bayesian network given in the previous question. Let “a”, “b”, “c”,
“d” and “e” denote the random variables shown in the network. Which of the following can
be inferred from the network structure?
(a) “a” causes “d”
(b) “e” causes “d”
(c) Both (a) and (b) are correct
Soln. D - Node “d” is dependendant on both “a” and “c” and “e” can not cause “d”
6. (2 Marks) A single box is randomly selected from a set of three. Two pens are then drawn
from this container. These pens happen to be blue and green colored. What is the probability
that the chosen box was Box A?
Box Green Blue Yellow

A 3 2 1
B 2 1 2
C 4 2 3
(a) 37/18
(b) 15/56
2
(c) 18/37
(d) 56/15
Soln. C -
Probability of choosing one box P (A) = P (B) = P (C) = 1/3. Here the event (E) is choosing
3 2
the green and blue balls from the random box. Therefore, P(E | A) = C61C∗ 2C1 = 6/15 = 2/5
2
C1 ∗1 C1
P(E | B) = 5C
= 2/10 = 1/5
2
4
C1 ∗2 C1 8
P(E | C) = 9C
= = 2/9
2 72/2
P (A | E) = P (E | A)/[P (E | A) + P (E | B) + P (E | C)]
2/5
=
(2/5) + (1/5) + (2/9)
= 18/37
7. (1 Mark) State True or False: The primary advantage of the tournament approach in
multiclass classification is its effectiveness even when using weak classifiers.
(a) True
(b) False
Soln. B - Refer to the lectures
8. (1 Mark) A data scientist is using a Naive Bayes classifier to categorize emails as either “spam”
or “not spam”. The features used for classification include:
• Number of recipients (To, Cc, Bcc)
• Presence of “spam” keywords (e.g., ”URGENT”, ”offer”, ”free”)
• Time of day the email was sent
• Length of the email in words
Which of the following scenarios, if true, is most likely to violate the key assumptions of Naive
Bayes and potentially impact its performance?
(a) The length of the email follows a non-Gaussian distribution

(b) The time of day is discretized into categories (morning, afternoon, evening, night)
(c) The proportion of spam emails in the training data is lower than in real-world email traffic
(d) There’s a strong correlation between the presence of the word ”free” and the length of
the email
Soln. D - This scenario violates the Naive Bayes assumption of feature independence, as it
the features are dependent on each other.
3
9. Consider the two statements:
Statement 1: Bayesian Networks are inherently structured as Directed Acyclic Graphs (DAGs).
Statement 2: Each node in a bayesian network represents a random variable, and each edge
represents conditional dependence.
Which of these are true?
(a) Both the statements are True.
(b) Statement 1 is true, and statement 2 is false.
(c) Statement 1 is false, and statement 2 is true.
(d) Both the statements are false.
Soln. A - Bayesian Networks are structured as DAGs and each node represents a random
variable, with edges indicating conditional dependencies.
4
Week 9
1. (2 marks) In the undirected graph given below, how many terms will be there in its potential
function factorization?
(a) 7
(b) 3
(c) 5
(d) 9
(e) None of the above
Soln. B - Three Cliques {A, B, C, D}, {A, D, E}, {B, C, F}

2. (2 Mark) Consider the following directed graph:
A B C
D E
Based on the d-separation rules, which of the following statements is true?

(a) A and C are conditionally independent given B
(b) A and E are conditionally independent given D
(c) B and E are conditionally dependent given C
(d) A and C are conditionally dependent given D and E
1
Soln. A - Conditioning on B blocks the only path A-B-C between A and C, making A and C
conditionally independent given B.
3. (1 Mark) Consider the following undirected graph:
A B E
C D
In the undirected graph given above, which nodes are conditionally independent of each other
given C? Select all that apply.
(a) A, E
(b) B, F
(c) A, D
(d) B, D
(e) None of the above
Soln. E - None of the paths between the given pairs are blocked when C is conditioned on.
4. (1 Marks) Consider the following statements about Hidden Markov Models (HMMs):
I. The ”Hidden” in HMM refers to the fact that the state transition probabilities are un-
known.
II. The ”Markov” property means that the current state depends only on the previous state.
III. The ”Hidden” aspect relates to the underlying state sequence that is not directly observ-
able.
IV. The ”Markov” in HMM indicates that the model uses matrix operations for calculations.
Which of the statements correctly describe the ”Hidden” and ”Markov” aspects of Hidden
Markov Models?
(a) I and II
(b) I and IV
(c) II and III
(d) III and IV
Soln. C - Refer to the lectures
2
5. (2 marks) For the given graphical model, what is the optimal variable elimination order when
trying to calculate P(E=e)?
A B C D E
(a) A, B, C, D
(b) D, C, B, A
(c) A, D, B, C
(d) D, A, C, A
Soln. A
XXXX
P (E = e) = P (a, b, c, d, e)
d c b a
XXXX
P (E = e) = P (a)P (b|a)P (c|b)P (d|c)P (e|d)
d c b a
X X X X
P (E = e) = P (e|d) P (d|c) P (c|b) P (a)P (b|a)
d c b a
6. (1 Marks) Consider the following statements regarding belief propagation:
I. Belief propagation is used to compute marginal probabilities in graphical models.

II. Belief propagation can be applied to both directed and undirected graphical models.
III. Belief propagation guarantees an exact solution when applied to loopy graphs.
IV. Belief propagation works by passing messages between nodes in a graph.
Which of the statements correctly describe the use of belief propagation?
(a) I and II
(b) II and III
(c) I, II, and IV
(d) I, III, and IV
(e) II, III, and IV
Soln. C - Refer to the lectures.
7. (1 Mark) HMMs are used for finding these. Select all that apply.
(a) Probability of a given observation sequence
(b) All possible hidden state sequences given an observation sequence
(c) Most probable observation sequence given the hidden states
(d) Most probable hidden states given the observation sequence
Soln. A, D - Refer to the lectures.
3
Week 10
1. (1 Mark) In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong

to class A, 15 to class B, and 5 to class C. What is the purity of this cluster?
(a) 0.5
(b) 0.6
(c) 0.7
(d) 0.8
Soln. B - Purity = (Number of data points in the most frequent class) / (Total number of
data points)
2. (1 Mark) Consider the following 2D dataset with 10 points:
(1, 1), (1, 2), (2, 1), (2, 2), (3, 3),
(8, 8), (8, 9), (9, 8), (9, 9), (10, 10)
Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?
(a) 4
(b) 5
(c) 8
(d) 10
Soln. C - To be a core point, it needs at least 3 points (including itself) within ϵ = 1.5
distance. There are 8 core points: (1,1), (1,2), (2,1), (2,2) from first group and (8,8), (8,9),
(9,8), (9,9) from second group.
3. (1 Mark) In BIRCH, using number of points N, sum of points SUM and sum of squared points
SS, we can determine the centroid and radius of the combination of any two clusters A and
B. How do you determine the radius of the combined cluster? (In terms of N,SUM and SS
of both two clusters A and B)
Radius of a cluster is given by:
r
SS SU M 2
Radius = −( )
N N
Note: We use the following definition of radius from the BIRCH paper: ”Radius is the average
distance from the member points to the centroid.”
q
(a) Radius = SS SU MA 2 SSB
NA − ( NA ) + NB − ( NB )
A SU MB 2
q q
(b) Radius = SS SU MA 2
NA − ( NA ) +
A SSB SU MB 2
NB − ( NB )
1
q
SSA +SSB
(c) Radius = NA +NB − ( SU MA +SU MB 2
NA +NB )
q
SSA SSB
(d) Radius = NA + NB − ( SU MA +SU MB 2
NA +NB )
Soln. C
CFA+B = CFA + CFB
Therefore,
NA+B = NA + NB
SU MA+B = SU MA + SU MB
SSA+B = SSA + SSB
Replace the above in the formula of radius given in the question to get the formula for the
combined cluster.
4. (1 Mark) Which of the following properties are TRUE?

(a) Using the CURE algorithm can lead to non-convex clusters.
(b) K-means scales better than CURE for large datasets.
(c) CURE is a simplification of K-means and hence scales better than k-means for large
datasets.
(d) K-means being more expensive to run on large datasets, can give non-convex clusters too.
Soln. A- Refer to the lecture videos
5. (1 Mark) The pairwise distance between 6 points is given below. Which of the option shows
the hierarchy of clusters created by single link clustering algorithm?
P1 P2 P3 P4 P5 P6
P1 0 3 8 9 5 4
P2 3 0 9 8 10 9
P3 8 9 0 1 6 7
P4 9 8 1 0 7 8
P5 5 10 6 7 0 2
P6 4 9 7 8 2 0
2
Soln. B
Step 1: Connect closest pair of points. Closest pairs are:
d(P3, P4) = 1
d(P5, P6) = 2
d(P1, P2) = 3
Step 2: Connect clusters with single link. The cluster pair to combine is bolded:
d({P1, P2}, {P3, P4}) = 8
d({P1, P2}, {P5, P6}) = 4
d({P5, P6}, {P3, P4}) = 6
Step 3: Connect the final 2 clusters
3
6. (1 Mark) For the pairwise distance matrix given in the previous question, which of the following
shows the hierarchy of clusters created by the complete link clustering algorithm.
Soln. B
Step 1: Same as previous question
Step 2: Connect clusters with complete link. The cluster pair to combine is bolded:
d({P1, P2}, {P3, P4}) = 9
d({P1, P2}, {P5, P6}) = 10
d({P5, P6}, {P3, P4}) = 8
Step 3: Connect the final 2 clusters
4
Week 11
1. (1 Mark) What constraint must be satisfied by the mixing coefficients (πk ) in a GMM?
(a) πk > 0 ∀ k
P
(b) k πk = 1
(c) πk < 1 ∀ k
P
(d) k πk = 0

2. (1 Mark) The EM algorithm is guaranteed to decrease the value of its objective function on
any iteration.
(a) True
(b) False
3. (1 Mark) Why might the EM algorithm for GMMs converge to a local maximum rather than
the global maximum of the likelihood function?
(a) The algorithm is not guaranteed to increase the likelihood at each iteration
(b) The likelihood function is non-convex
(c) The responsibilities are incorrectly calculated
(d) The number of components K is too small
4. (1 Mark) What does soft clustering mean in GMMs?
(a) There may be samples that are outside of any cluster boundary.
(b) The updates during maximum likelihood are taken in small steps, to guarantee conver-
gence.
(c) It restricts the underlying distribution to be gaussian.
(d) Samples are assigned probabilities of belonging to a cluster.
Soln. D - Refer to the lectures

5. (2 Marks) KNN is a special case of GMM with the following properties: (Multiple Correct)
1
(a) γi = i
(2πϵ)1/2
e− 2ϵ
(b) Covariance = ϵI
(c) µi = µj ∀i, j
1
1
(d) πk = k
Soln. B, D - Refer to the lectures

6. (1 Mark) We apply the Expectation Maximization algorithm to f (D, Z, θ) where D denotes
the data, Z denotes the hidden variables and θ the variables we seek to optimize. Which of
the following are correct?
(a) EM will always return the same solution which may not be optimal
(b) EM will always return the same solution which must be optimal
(c) The solution depends on the initialization
Soln. C - Refer to the lectures
7. (1 Mark) True or False: Iterating between the E-step and M-step of EM algorithms always
converges to a local optimum of the likelihood.
(a) True
(b) False
Soln. A - Refer to the lectures

8. (2 Marks) The number of parameters needed to specify a Gaussian Mixture Model with 4
clusters, data of dimension 5, and diagonal covariances is:
(a) Lesser than 21
(b) Between 21 and 30
(c) Between 31 and 40
(d) Between 41 and 50
Soln. D - For a GMM with 4 clusters in 5D with diagonal covariances, we need: means
(4×5=20), diagonal covariances (4×5=20), and mixing coefficients (4-1=3). Total parameters
= 20 + 20 + 3 = 43 parameters.
2
Week 12
1. (1 Mark) What is the VC dimension of the class of linear classifiers in 2D space?

(a) 2
(b) 3
(c) 4
Soln. B - Any 3 points can be classified using a linear decision boundary
2. (1 Mark) Which of the following learning algorithms does NOT typically perform empirical
risk minimization?
(a) Linear regression
(b) Logistic regression
(c) Decision trees
(d) Support Vector Machines
Soln. D - Refer to the lectures
3. (2 Marks) Statement 1: As the size of the hypothesis class increases, the sample complexity
for PAC learning always increases.
Statement 2: A larger hypothesis class has a higher VC dimension.
Choose the correct option:
(a) Statement 1 is true. Statement 2 is true. Statement 2 is the correct reason for statement
1
(b) Statement 1 is true. Statement 2 is true. Statement 2 is not the correct reason for
statement 1
(c) Statement 1 is true. Statement 2 is false
(d) Both statements are false
4. (1 Mark) When a model’s hypothesis class is too small, how does this affect the model’s
performance in terms of bias and variance?
(a) High bias, low variance
(b) Low bias, high variance
(c) High bias, high variance
(d) Low bias, low variance
Soln. A - Refer to the lectures
1
5. (1 Mark) Imagine you’re designing a robot that needs to navigate through a maze to reach a
target. Which reward scheme would be most effective in teaching the robot to find the shortest
path?
(a) +5 for reaching the target, -1 for hitting a wall
(b) +5 for reaching the target, -0.1 for every second that passes before the robot reaches the
target.
(c) +5 for reaching the target, -0.1 for every second that passes before the robot reaches the
target, +1 for hitting a wall.
(d) -5 for reaching the target, +0.1 for every second that passes before the robot reaches the
target.
Soln. B - The +5 reward for reaching the target encourages goal achievement, while the -0.1
penalty for each second promotes finding the shortest path. Omitting rewards for hitting walls
as question has nothing in this regard.
For the rest of the questions, we will follow a simplistic game and see how a Reinforcement
Learning agent can learn to behave optimally in it.
This is our game:
At the start of the game, the agent is on the Start state and can choose to move left or right
at each turn. If it reaches the right end(RE), it wins and if it reaches the left end(LE), it loses.
Because we love maths so much, instead of saying the agent wins or loses, we will say that the
agent gets a reward of +1 at RE and a reward of -1 at LE. Then the objective of the agent is
simply to maximum the reward it obtains!
6. (1 Mark) For each state, we define a variable that will store its value. The value of the state
will help the agent determine how to behave later. First we will learn this value.
Let V be the mapping from state to its value.

Initially,
V(LE) = -1
V(X1) = V(X2) = V(X3) = V(X4) = V(Start) = 0
V(RE) = +1
For each state S ∈ {X1, X2, X3, X4, Start}, with SL being the state to its immediate left and
SR being the state to its immediate right, repeat:
V (S) = 0.9 × max(V (SL ), V (SR ))

Till V converges (does not change for any state).
What is V(X4) after one application of the given formula?
2
(a) 1
(b) 0.9
(c) 0.81
(d) 0
Soln. B -
V (X4) = 0.9 × max(V (X3), V (RE))
V (S) = 0.9 × max(0, +1) = 0.9
7. (1 Mark) What is V(X1) after one application of given formula?

(a) -1
(b) -0.9
(c) -0.81
(d) 0
Soln. D -
V (X1) = 0.9 × max(V (LE), V (X2))
V (S) = 0.9 × max(−1, 0) = 0
8. (2 Marks) What is V(X1) after V converges?

(a) 0.59
(b) -0.9
(c) 0.63
(d) 0
Sol. A - This is the sequence of changes in V:
V (X4) = 0.9 → V (X3) = 0.81 → V (Start) = 0.729 → V (X2) = 0.656 → V (X1) = 0.59
Final value for X1 is 0.59.

Wa0030.

Uploaded by

Copyright:

Available Formats

Wa0030.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wa0030.

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning

(Common Data for Questions 4 and 5)

(a) Both functions F1 & F2

(a) A set of rules governing the environment

Sol. c: Refer to lecture

(a) E[fˆ(x)] − f (x), E (E[fˆ(x)] − fˆ(x))2

(c) (E[fˆ(x)] − f (x))2 , E (E[fˆ(x)] − fˆ(x))2

(a) Models which overfit have a high bias

(a) Both A and R are true and R is the correct explanation of A

(a) Where δ1 > δ2

What are the prior probabilities of class A and class B?

(a) P (A) = 0.5, P (B) = 0.5

(a) The gradient of the hyperplane

4. (1 Mark) Consider the 1 dimensional dataset:

(Note: x is the feature and y is the output)

5. (1 Mark) Consider a polynomial kernel of degree d operating on p-dimensional input vectors.

(a) 0.91, 0.64

(a) Rotating x1 and x2 by a fixed angle.

(a) Statement is true and reason is false.

10. Choose the correct statement (multiple may be correct):

(a) MLE is a special case of MAP when prior is a uniform distribution.

Sol. (a), (d)

1. (1 Mark) Entropy for a 90-10 split between two classes is:

7. (2 Marks) Consider the following dataset:

feature1 feature2 output

1. (1 Mark) Define active learning:

Predicted Positive Predicted Negative

What are the precision, recall, and accuracy of this classifier?

1. (1 Mark) In Bagging technique, the reduction of variance is maximum if:

(a) It’s exactly the same as standard bagging

Soln. A, D - Refer to the lectures

Box Green Blue Yellow

Soln. B - Refer to the lectures

(a) The length of the email follows a non-Gaussian distribution

Soln. B - Three Cliques {A, B, C, D}, {A, D, E}, {B, C, F}

Based on the d-separation rules, which of the following statements is true?

3. (1 Mark) Consider the following undirected graph:

Soln. C - Refer to the lectures

6. (1 Marks) Consider the following statements regarding belief propagation:

I. Belief propagation is used to compute marginal probabilities in graphical models.

Which of the statements correctly describe the use of belief propagation?

Soln. C - Refer to the lectures.

1. (1 Mark) In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong

4. (1 Mark) Which of the following properties are TRUE?

Step 3: Connect the final 2 clusters

Step 3: Connect the final 2 clusters

Soln. B - Refer to the lectures

Soln. D - Refer to the lectures

Soln. B, D - Refer to the lectures

Soln. A - Refer to the lectures

1. (1 Mark) What is the VC dimension of the class of linear classifiers in 2D space?

Let V be the mapping from state to its value.

V (S) = 0.9 × max(V (SL ), V (SR ))

What is V(X4) after one application of the given formula?

7. (1 Mark) What is V(X1) after one application of given formula?

8. (2 Marks) What is V(X1) after V converges?

You might also like