COL774 Practice Problems
COL774 Practice Problems
COL774 Practice Problems
Fall 2024-2025
Practice Problems
1 Introduction
1. You may refer to [Petersen and Pedersen, 2021] for matrix derivatives for the upcoming problems.
7. When the bias term is taken as a feature, the absorb b in w and the loss functions in item 6 can
be denoted as LSSE (w), LLogistic (w), L2 (w), L1 (w), respectively.
1
2 Regression
2.1 GFLOPS Calculation
For matrix multiplication Y = WX, where X ∈ Rm×n matrix and W ∈ Rn×1 :
2
2.5 Convex Functions
A function f : Rn → R is called convex if for all x1 , x2 ∈ Rn and θ ∈ [0, 1], the following inequality
holds:
f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )
This property ensures that the function lies below the straight line connecting any two points on its
graph. Convex functions play a crucial role in optimization problems, as they ensure s they ensure that
any local minimum is also a global minimum.
(This is convex but you can’t do it by computing the Hessian as the function is not differentiable
at x = 0. You’ll need to prove using the definition in Convex Functions and the Cauchy–Schwarz
Inequality.)
3. f (x) = xT Ax, where A ∈ Rn×n , x ∈ Rn . A is Diagonal.
(f (x, y) is convex ⇐⇒ A = 0)
3
2.7 Probabilistic Interpretation of Regularization
In class we looked at the Maximum Likelihood Estimate of w for the Linear Regression problem. We
considered w as parameter and the optimum wM LE (for a single example) was defined as
wM LE = arg max P (y (i) |x(i) ; w)
w
Now, similar to the MLE case, we assume, over all examples, y (i) − ŷ (i) |x(i) , w are i.i.d distributed as
y (i) − ŷ (i) |x(i) , w ∼ N (0, σ 2 )
a) Simplify Equation 1 for the entire dataset combined now, and show solving wM AP is similar to
Linear Regression with L2 Regularization on w.
b) Obtain explicitly, the priori on w that will lead to the L1 Regularization of w (keep different
penalty coefficient for each of the n components of w). Does this distribution have a name?
Consider Logistic Regression on Perfectly Linearly Separable data, as shown in Figure 1. You
may not assume that there are just 2 features as shown in the figure.
4
a) What is the condition (possibly implicit) for a data to be linearly separable, when the features
x(i) ∈ Rn .
b) Consider solving for the decision boundary (w, b) for such data. Show what ∀ϵ > 0 ∃(wϵ , bϵ ) such
that LLosgistic (w, b) < ϵ.
c) Consider running gradient descent, to obtain (w, b). Assuming the learning rate is small, will the
algorithm ever converge in (w, b)? (Answer: w and b diverge. You may skip it as it’s difficult to
prove.)
(i)
zj = wjT x(i) for j = 1, 2....r (2)
(i) ezj
ŷj = Pr zk
for j = 1, 2....r (3)
k=1 e
Equation 3 is the Softmax function. (Why Softmax?) The loss function, for a single example (i),
y (i) ∈ {1, 2....r} is
r
1{y(i) =j} log(ŷj(i) )
X
(i)
L (w1 , w2 ...wr ) = − (4)
j=1
a) Is L(w1 , w2 ...wr ) convex? (Answer: Yes! But you may skip it as the proof gets very involved.)
b) Obtain explicitly, ∇w1 L, ∇w2 L.....∇wr L. Hint: Start by writing L in terms of (w1 , w2 ...wr ).
5
We’ll see in this problem how. Let x ∈ R. Let a > 0, λ > 0, b, c, x ∈ R and
w∗ = arg min xT Ax + bT x + c
x∈Rn
Show that
α
w(α)i = sign(wi ) max(0, |wi | − )
ai
What happens to components of w(α) corresponding to smaller diagonal entries of A? How can
this be used for feature selection?
(x−xi )2
e− 2σ 2
ασ,i (x) = (x−xj )2
Pm
j=1 e− 2σ 2
ŷi = wxi + b
m
1 X
Lσ (x, w, b) = ασ,i (x)(yi − ŷi )2 + λw2
m i=1
Denote
6
2.12 Unique Minima
In class we discussed that the problem of Linear Regression has a unique optimum. We’ll try to prove
that formally and see what happens to stationary points when convexity is not guaranteed. Consider
the function
1 T
x Ax − bT x + c
f (x) = (5)
2
Where c ∈ R, x ∈ Rn , b ∈ Rn , A ∈ Rn×n and is Symmetric. All eigenvalues of A are positive. We
define x∗ ∈ Rn such that ∇x f |x=x∗ = 0.
b) We’ll now prove Equation 6. Set x = x∗ + ϵz, zT z = 1, ϵ > 0 in the equation, and show that
ϵ2 T
f (x∗ + ϵz) − f (x∗ ) = z Az (7)
2
Assume the fact, ”Every n × n Real Symmetric Matrix is Diagonalizable and has a set of n
Orthonormal Eigenvectors”. Express z as
n
X
z= ai vi
i=1
c) Now, let all eigenvalues of A be non-zero and not all of the same sign. Obtain explicitly, atleast
one z ∈ Rn such that ∥z∥2 = 1 and f (x∗ + ϵz) − f (x∗ ) < 0 for some ϵ > 0.
d) Show that the Linear Regression objective can be expressed Equation 5. Does the condition, ”A
is Symmetric. All eigenvalues of A are positive” assumed here always hold in linear Regression?
Clearly describe the condition on the data when this isn’t true. We’ll derive these conditions in
Multiple Minimas.
7
Absorb the bias term in Linear Regression into x(i) ’s. We derived the condition for the optimal w∗
as (derive it yourself if you don’t remember this),
XT Xw∗ = XT y (9)
XT CXv = 0 ⇐⇒ Xv = 0
e) Now let XT X be non-invertible. Using the above part, what do you conclude about the columns of
X, the data matrix (note that there’s also the column of all 1’s in X). Let w1∗ be one w minimizing
LSSE (w), obtain another.
f) Continuing item e), if you have to choose a subset of the initial n features, for regression, without
losing any information, how would you do that?
g) Following, the above part, can you realize why assuming A in Unique Minima to be positive
definite, instead of positive semi-definite, is reasonable and doesn’t lead to loss of generality?
h) Redo all above parts for Weighted Linear Regression, with example-wise weights, u1 , u2 ...um , all
positive. You’ll see all these results hold true for this case too!
ŷ (i) = wT x(i)
Mw = 0
i.e. M ∈ R4×n and rank(M) = 4. Under the above constraints obtain the optimum value of w that
minimizes these loss functions.
a)
LSSE (w)
1 For z∈ Rn , zT z = 0 ⇐⇒ z = 0.
8
b)
LSSE (w) + λwT Aw
Note that it is not intended that one uses Lagrangian Multipliers to solve this.
9
3 Neural Networks
3.1 Backpropagation in Neural Networks
Following the terminology used in class. Write Backpropagation for Neural Networks yourself and
hence, the update rule for each weight and bias in the network. Consider the r class classification problem
with Softmax in the last layer and the Cross Entropy loss function.
• Layer 2: Assume ẑ has size h2 × w2 × d2 . A max pooling filter of shape a × a with stride a runs
over ẑ to produce z. Assume a divides h1 and w1 .
– Write down the exact dimensions of z and the expression for each zijk in terms of ẑ.
∂L ∂L
– Assuming for all i, j, k in the size of z, ∂zijk is known, obtain ∂ẑpqk .
• Realize that ”sliding” the Convolution Filter over the input feature map is just for intuition. All
features in the output feature map can be computed in parallel.
a) Using a single neuron, implement the the OR and AND gates i.e. when x ∈ {0, 1}n , fn (x) should
output the corresponding boolean logic.
b) Realize that using a single neuron, fn you can implement every possible linear decision boundary in
Rn . Note that for a single geometric hyperplane in Rn , there are two possible decision boundaries
(See Figure 2 and Figure 3).
10
c) If you want to partition R2 into different regions using straight lines, each classified as one of
{0, 1}, you can draw these lines and use boolean logic to model on which side a point in R2 lies
w.r.t each line (one of {0, 1}). Taking ideas from the above parts, draw a 2 Layer network with
atmost 3 neurons that can draw every decision boundary shaped like Figure 5 and Figure 4. Can
the same network be use to draw decision boundaries Figure 2 and Figure 3. Can this network
draw Figure 6?
d) Show why you can’t implement the XOR gate using a single neuron. Now, use two layers of neurons
(atmost 3 neurons in total) to implement XOR, in case of x ∈ {0, 1}2 .
e) Using atmost 3 layers and 5 neurons, implement all decision boundaries like Figure 6. Will this
also represent all decision boundaries Figure 2 through Figure 6.
f) Think about the minimum number of layers and neurons needed to implement any boolean function
of n variables. Using the ideas above, can you design a Neural Network that can perfectly classify
a m points in R2 , without knowing their positions and labels beforehand.
11
Figure 2: Boundary 1 Figure 3: Boundary 2
12
Figure 6: Boundary 5
”when applied to linear regression, dropout is equivalent to L2 weight decay, with a different weight
decay coefficient for each input feature”
a) While training, notice that the output M(x) is a random variable. Show that the model output
can be expressed precisely as
M(x) = wT (x ⊙ d) + b
Where ⊙ represents the element-wise product and d ∈ {0, 1}n is a random vector.
b) Obtain the probability distribution of d. Hence compute E[d] and E[ddT ].
13
c) Prove these trivial properties, x is a constant.
e) Express L(w, b) as L̃(w̃, b) where w̃ = pw (because finally we’ll use w̃ for evaluation). Assume all
n features have been shifted to zero data mean. Show that
m m
1 X T (i) 1−pX 2 2
E[L̃(w̃, b)] = (w̃ x + b − y (i) )2 + σ w̃ (12)
m i=0 p j=1 j j
f) Observe that w̃ = pw =⇒ L(w, b) = L̃(w̃, b), hence minimizing L(w, b) over (w, b) is equivalent to
minimizing L̃(w̃, b) over (w̃, b). Clearly Equation 12 shows that using dropouts in Linear Regression
has the effect of L2 regularization on w̃. What can you say about the penalty corresponding to
two different features relatively?
g) Think about the strength of regularization in the limits, p → 0+ and p → 1− . Was this expected?
14
4 Gaussian Discriminant Analysis and Naive Bayes
4.1 Distribution of Multivariate Gaussian
T
Let random vector X = X1 X2 . . . Xn . Let Z ∈ Rl be the standard normal random
vector (i.e. Zi ∼ N (0, 1) for i = 1, . . . , ℓ are i.i.d.). Then X1 , . . . , Xn are jointly Gaussian if there
exist µ ∈ Rn , A ∈ Rn×l such that X = Az + µ.
We will see how this leads to the famous multivariate Gaussian distribution when the covariance
matrix Σ is invertible. The joint moment generation function of random variables X1 , X2 ...Xn is defined
as
M (t1 , t2 , . . . , tn ) = E[et1 X1 ,...,tn Xn ]
[Ross, 2014] in ”Chapter 7 Properties of Expectation” has mentioned
It can be proven (although the proof is too advanced for this text) that the joint moment generating
function M (t1 , . . . , tn ) uniquely determines the joint distribution of X1 , . . . , Xn .
T
Denote t = t1 t2 . . . tn . Given the facts
σ2
Y ∼ N (µ, σ 2 ) =⇒ E[exp (Y )] = exp (µ + )
Z Z 2
f (x)dn x = f (Qα + b)|Q|dn α
x∈Rn α∈Rn
a) Prove that
tT AAT t
E[exp (tT X)] = exp (tT µ + )
2
b) Prove that if PDF of X is
then
tT Σt
E[exp (tT X)] = exp (tT µ + )
2
If you need a hint see this.2
y ∼ Bernoulli(ϕ)
x|y = 0 ∼ N (µ0 , Σ0 )
x|y = 1 ∼ N (µ1 , Σ1 )
2 Σ is symmetric positive definite, so it has a unique symmetric positive definite square root, Write the integral and
1
substitute x = Σ 2 α + (µ + Σt).
15
4.3 Decision Boundary in GDA
Show that the decision boundary of GDA has equation
s
1 T −1 1 T −1 1−ϕ |Σ1 |
x (Σ1 − Σ−1 T −1 T −1 T −1
0 )x − (µ1 Σ1 − µ0 Σ0 )x + (µ1 Σ1 µ1 − µ0 Σ0 µ0 ) + ln( )=0
2 2 ϕ |Σ0 |
Give an example of values of ϕ, µ0 , µ1 , Σ0 , Σ1 so that the decision boundary has equation
wT x + b = 0
16
Where a0 , a1 > 0 are fixed. Φ0 and Φ1 follow the distributions, Dirichlet(a0 + 1, . . . , a0 + 1) and
Dirichlet(a1 + 1, . . . , a1 + 1), more popularly the Beta distribution, when |V | = 2. Assume that
• All training examples are independent.
• There is atleast one example of each class.
a) Show that
Qm
(i) (i) p(ϕ)p(Φ0 )p(Φ1 ) i=1 p(x(i) , y (i) |ϕ, Φ0 , Φ1 ))
p(ϕ, Φ0 , Φ1 |{(x , y )}m
i=1 ) =
p({(x(i) , y (i) )}m
i=1 )
b) The naive Bayes assumption is that the words (random variables) in different positions X1 , X2 , . . . , Xdi
give y are independent. Show that
di |V | (i) (i)
(i) (i)
(i) (i) Y Y 1{x =k∧y =0} 1{xj =k∧y =1}
p(x(i) , y (i) |ϕ, Φ0 , Φ1 ) = ϕy (1 − ϕ)(1−y ) ϕ j ϕk0
k1
j=1 k=1
Pm Pdi (i)
a0 + j=1 1{xj =
i=1 k ∧ y (i) = 0}
ϕk0 = Pm
a0 |V | + i=1 1{y (i) = 0}di
Pm Pdi (i)
a1 + i=1 j=1 1{xj = k ∧ y (i) = 1}
ϕk1 = Pm
a1 |V | + i=1 1{y (i) = 1}di
Pm (i)
i=1 1{y = 1}
ϕ=
m
P|V | P|V |
Keep in mind that the parameters are constrained, k=1 ϕk0 = k=1 ϕk1 = 1
17
5 Principal Component Analysis
5.1 Variance and covariance of projected features
Consider the data D = {x(i) }m n
i=1 . Let u, v ∈ R be such that ∥u∥2 = ∥v∥2 = 1. Σ ∈ R
n×n
is the
covariance if D. We now look at features formed by projecting D along the directions u and v.
a) Show that
Var(uT x(i) ) = uT Σu
Cov(uT x(i) , vT x(i) ) = uT Σv
1
Pm (i)
Denote x̂ = m i=1 x̂ . We have the two optimization problems (under the constraints given before)
1. Maximizing projected variance
m 2
1 X (i)
max x̂ − x̂ (16)
u1 ,u2 ,...,uk m · n i=1 2
a) Assume all x(i) ’s are exactly the same (i.e. the total variance in 0) but all y (i) ’s are distinct. Can
you learn anything from this data? Do you realize why variance in the data is related to the
information it gives?
b) Show that both the problems above are equivalent. That is any set {u1 , u2 , . . . , uk } maximizing
Equation 16 minimizes Equation 17 and vice-versa.
18
5.3 Is the solution unique?
Look at the optimization problem described in Projection Loss and maximizing variance for mini-
mizing projection loss, Equation 17. {u1 , u2 , . . . , uk } is a set of orthonormal vectors. Denote U ∈ Rn×k
as the matrix with columns ui ’s in that order. Let A ∈ Rk×k be orthonormal, and U′ = UA. Let
{u′1 , u′2 , . . . , u′k } be the columns of U′ .
a) Show that {u′1 , u′2 , . . . , u′k } are orthonormal.
b) Show that both {u1 , u2 , . . . , uk } and {u′1 , u′2 , . . . , u′k } give the same objective value in Equation 17.
Since you already know, taking the eigenvectors of Σ, the data covariance as {u1 , u2 , . . . , uk } minimizes
the projection loss. Now can you comment if this is the only solution to the problem of minimizing
projection loss? We asked you the case, k = 1 of this problem in Quiz 10!
With some more work, you can show that span({u1 , u2 , . . . , uk }) = span({u′1 , u′2 , . . . , u′k }). Using this,
grasp the fact that while minimizing projection loss, what we’re actually looking for is a subspace to
project the data onto. The Principal Components are actually a more informative representation of this
subspace, because for every k ≤ n, the span of the top k eigenvectors is the optimal subspace.
6 Decision Trees
6.1 Fast Greedy Pruning
Consider data of the form {(x(i) , y (i) )}m
i=1 where x
(i)
is an n-dimensional feature vector and y (i) ∈
{0, 1}. Denote the validation set by V . We wish to implement the Greedy Pruning algorithm to prune
a trained decision tree T using the validation set V .
In each iteration, we consider all non-leaf nodes in T , and remove the entire subtree of a node (hence
making it the leaf) that leads to maximum increase in accuracy on V . The process ends when any
remaining nodes in the tree lead to a decrease in accuracy on V .
Naively implementing this algorithm takes O(niters · |V | · |T |2 ) time. Here, in each iteration we tem-
porarily prune a node and run V through the tree, to get accuracy on |V |. Finally we actually prune
the node based on the rule described above.
Think of an algorithm that can do this in O(|V | · depth(T ) + |T | + niters · |T | · depth(T )) time. Realize
why this is a very great speed up.
You can find a well-tested Python implementation of this algorithm here [Harshit0143, 2023]. Hints
for the Algorithm
1. During the pruning process, the split rule at every node remains the same. The set of examples
from V that reach any node of T during pruning remains exactly the same.
2. Store n0 , n1 , ntrue , the number of examples from class 0, class 1 from |V | that reach this node
and number of examples from these that are correctly classified, respectively. This can be done in
O(|V | · depth(T ) + |T |) time.
3. If you have to prune a non-leaf node ver, then n0 , n1 remains unchanged throughout the tree. For
which nodes does ntrue change? You can update this value for all such nodes in total O(depth(T ))
time.
19
7 Support Vector Machines
7.1 SVM Kernels: But why bother duals?
1 2
min ∥w∥2
w,b 2
s.t. y (i) (wT x(i) + b) ≥ 1, i ∈ {1, 2, . . . , m}
m m
X 1 X (i) (j)
max W (α) = αi − y y αi αj ⟨x(i) , x(j) ⟩
α
i=1
2 i,j=1
s.t. αi ≥ 0, i ∈ {1, 2, . . . , m}
m
X
αi y (i) = 0.
i=1
Look at the data in Figure 7. The feature vector is expressed as x = [x1 x2 ]T . The data is generated
by sampling x from N (0, I) and filtering out points, to |x1 | ≤ 5 and |x2 | ≤ 5. Points within the circle
20
with center (0, 0) and radius 3 are labeled red and those outside that of radius 4 are labeled blue. Other
points are ignored.
a) Can simple SVM’s (without using kernels) be used to classify the data perfectly in Figure 7.
b) Suggest a transformation ϕ : R2 → Rk such that when applied to the feature vectors in the data
first, the data in Figure 7 can be perfectly classified using SVM.
c) When the number of features is 3, to be able to learn any quadratic decision boundary (using SVM
in Figure 8), you can use the transform
Think of a ϕ that can be used when x ∈ Rn and you’ve to be able to learn all boundaries of degree
≤ d. What is the size of the output vector?3
d) Continuing the previous part, you have a Black Box that can solve Figure 8. What is the size of
this problem, i.e. the number of coefficients in the objective and constraints. How much time does
it take to compute all of them?
e) In the dual problem, notice that x appears only in the inner product. You have been given a ϕ
that maps x ∈ Rn to space with features, all products of x1 , x2 , . . . , xn of degree ≤ d.
c is a hyperparameter that decides weight for terms of lower degrees. You have a Black Box that
can solve Figure 9. Do you realize that, given Equation 18, you do not need to worry about the
exact form of ϕ? What is the size of the problem now? How much time does it take to compute
all the coefficients, without (a rough lower bound) and with the kernel trick in Equation 18?
21
References
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT
Press. http://www.deeplearningbook.org.
[Harshit0143, 2023] Harshit0143 (2023). COL774 Fall 2023-2024 Assignments. https://github.com/
Harshit0143/Machine-Learning-Assignments.
[Ng and Ma, 2023] Ng, A. and Ma, T. (June 11, 2023). Cs229 lecture notes. https://cs229.stanford.
edu/main_notes.pdf.
[Petersen and Pedersen, 2021] Petersen, K. B. and Pedersen, M. S. (November 2021). The matrix cook-
book. https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.
[Ross, 2014] Ross, S. M. (2014). A First Course in Probability. Pearson, Boston, 9th edition.
22