COL774 Practice Problems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

COL774: Machine Learning

Fall 2024-2025
Practice Problems

1 Introduction
1. You may refer to [Petersen and Pedersen, 2021] for matrix derivatives for the upcoming problems.

2. The dataset is denoted as D = {(x(i) , y (i) )}m


i=1 where x
(i)
∈ Rn and y (i) ∈ Rn for Linear Regression.
(i)
Further y ∈ {0, 1} for Logistic Regression.
T
3. w ∈ Rn and w = [ w1 w2 w3 ... wn ] .

4. For Linear Regression, ŷ (i) = wT x(i) + b.


5. For Logistic Regression, ŷ (i) = g(wT x(i) + b), where g : R → R is an activation function.
6.
m
1 X (i)
LSSE (w, b) = (y − ŷ (i) )2
m i=1
m
1 X (i)
LLogistic (w, b) = − y log(ŷ (i) ) + (1 − y (i) ) log(1 − ŷ (i) )
m i=1
L2 (w, b) = wT w
Xn
L1 (w, b) = |wi |
i=1

7. When the bias term is taken as a feature, the absorb b in w and the loss functions in item 6 can
be denoted as LSSE (w), LLogistic (w), L2 (w), L1 (w), respectively.

1
2 Regression
2.1 GFLOPS Calculation
For matrix multiplication Y = WX, where X ∈ Rm×n matrix and W ∈ Rn×1 :

1. Using a Python implementation with nested for loops:


• Calculate the GFLOPS (Giga-Floating Point Operations Per Second) for the matrix multipli-
cation operation. Determine the total number of floating-point operations involved, including
both multiplication and addition operations, divide this by the time taken to execute the
matrix multiplication, and convert the result to GFLOPS by dividing by 109 .
2. Using Python’s numpy library:
• Similarly, calculate the GFLOPS for the matrix multiplication operation using numpy’s built-in
matrix multiplication function.
3. Comparison and Visualization:
• Plot a graph with the x-axis representing m (considering constant n) and the y-axis repre-
senting GFLOPS to illustrate the performance difference between the two implementations.

2.2 Binary Classification


For binary classification using the sigmoid function, write the gradient descent update rule for the
following methods:

1. Stochastic Gradient Descent (SGD):


Update the model parameters using one data point at a time.
2. Batch Gradient Descent:
Update the model parameters using the entire dataset.
3. Mini-Batch Gradient Descent:
Update the model parameters using a subset of the dataset (a mini-batch).
1
Use the sigmoid function σ(z) = 1+e−z .

2.3 Regularized Linear Regression


For Linear Regression, obtain w∗ that minimizes LSSE (w) + λL2 (w). λ > 0 is fixed.

2.4 Regression Line passes through mean


For Linear Regression, show that the line (w∗ , b∗ ) minimizing LSSE (w, b) always satisfies
m m
∗ T 1 X (i) 1 X (i)
(w ) ( x ) + b∗ = y
m i=1 m i=1

Hint: See the equation obtained on setting,



LSSE (w, b) (w∗ ,b∗ )
=0
∂b

2
2.5 Convex Functions
A function f : Rn → R is called convex if for all x1 , x2 ∈ Rn and θ ∈ [0, 1], the following inequality
holds:
f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )
This property ensures that the function lies below the straight line connecting any two points on its
graph. Convex functions play a crucial role in optimization problems, as they ensure s they ensure that
any local minimum is also a global minimum.

2.5.1 Discuss convexity


1. f (x) = eax , f (x) = log(x). f (x) = x3 in their respective domains.

(Use the second derivative of each function)

For the below parts, do it by computing the hessian.


pP n
2. f (x) = ∥x∥2 = 2
i xi , where x ∈ R .

(This is convex but you can’t do it by computing the Hessian as the function is not differentiable
at x = 0. You’ll need to prove using the definition in Convex Functions and the Cauchy–Schwarz
Inequality.)
3. f (x) = xT Ax, where A ∈ Rn×n , x ∈ Rn . A is Diagonal.

(Convex if and only if all entries of A are non-negative)

4. f (x) = xT Ax, where A ∈ Rn×n , x ∈ Rn . A is Symmetric.

(Convex if and only if all eigenvalues of A are non-negative)


5. f (x, y) = xT Ay, where A ∈ Rm×n , x ∈ Rm , and y ∈ Rn .

(f (x, y) is convex ⇐⇒ A = 0)

2.5.2 Convexity of linear Combination


Using the definition of convexity given above, show that, if f : Rn → R and g : Rn → R are convex,
and a, b > 0, the function af + bg : Rn → R is convex, i.e., for any x, y ∈ Rn and λ ∈ [0, 1],

(af + bg)(λx + (1 − λ)y) ≤ λ(af + bg)(x) + (1 − λ)(af + bg)(y)

2.6 Convexity of Regression Objectives


By computing the hessian show that, LSSE (w), LLogistic (w), L2 (w) are convex. From the definition
of convexity in Convex Functions show that L1 (w) is convex (as it’s not differentiable. Hint: Use the
triangle inequality).
Using Convexity of linear Combination do you see that why all four below functions are convex?

{LSSE (w), LLogistic (w)} + λ{L2 (w), L1 (w)}

3
2.7 Probabilistic Interpretation of Regularization
In class we looked at the Maximum Likelihood Estimate of w for the Linear Regression problem. We
considered w as parameter and the optimum wM LE (for a single example) was defined as
wM LE = arg max P (y (i) |x(i) ; w)
w

Now, let’s treat w as a Random Variable, distributed as (this is called a priori on w)


w ∼ N (0, Σ)
where Σ = diag{λ1 , λ2 ..λn } and λ1 , λ2 .....λn > 0. We now define wM AP (MAP stands for Maximum a
Posteriori) as

wM AP = arg max P (y (i) |x(i) , w)P (w) (1)


w

Now, similar to the MLE case, we assume, over all examples, y (i) − ŷ (i) |x(i) , w are i.i.d distributed as
y (i) − ŷ (i) |x(i) , w ∼ N (0, σ 2 )
a) Simplify Equation 1 for the entire dataset combined now, and show solving wM AP is similar to
Linear Regression with L2 Regularization on w.
b) Obtain explicitly, the priori on w that will lead to the L1 Regularization of w (keep different
penalty coefficient for each of the n components of w). Does this distribution have a name?

Solution to above part. It’s called the Laplace Distribution.


n
Y λi
p(w1 , w2 ...wn ) = e−λi |wi |
i=1
2

2.8 Logistic Regression in Perfectly Linearly Separable case

Figure 1: Data for Logistic Regression, features x1 and x2

Consider Logistic Regression on Perfectly Linearly Separable data, as shown in Figure 1. You
may not assume that there are just 2 features as shown in the figure.

4
a) What is the condition (possibly implicit) for a data to be linearly separable, when the features
x(i) ∈ Rn .
b) Consider solving for the decision boundary (w, b) for such data. Show what ∀ϵ > 0 ∃(wϵ , bϵ ) such
that LLosgistic (w, b) < ϵ.

c) Consider running gradient descent, to obtain (w, b). Assuming the learning rate is small, will the
algorithm ever converge in (w, b)? (Answer: w and b diverge. You may skip it as it’s difficult to
prove.)

2.9 Logistic Regression for multiple classes


Consider r class classification problem. We build a model that predicted an input belonging to each
of the r classes (similar to Logistic Regression).
The model is parametrized by r weight vectors, w1 , w2 ...wr . For an input x(i) , the model is evaluated
as,

(i)
zj = wjT x(i) for j = 1, 2....r (2)
(i) ezj
ŷj = Pr zk
for j = 1, 2....r (3)
k=1 e

Equation 3 is the Softmax function. (Why Softmax?) The loss function, for a single example (i),
y (i) ∈ {1, 2....r} is
r
1{y(i) =j} log(ŷj(i) )
X
(i)
L (w1 , w2 ...wr ) = − (4)
j=1

Where 1(.) is the indicator function.


(
1 if x is True
1(x) =
0 if x is False

Equation 4 is know is the Cross Entropy Loss function. Now, define


m
X
L(w1 , w2 ...wr ) = L(i) (w1 , w2 ...wr )
i=1

a) Is L(w1 , w2 ...wr ) convex? (Answer: Yes! But you may skip it as the proof gets very involved.)
b) Obtain explicitly, ∇w1 L, ∇w2 L.....∇wr L. Hint: Start by writing L in terms of (w1 , w2 ...wr ).

2.10 LASSO for feature selection


[Goodfellow et al., 2016] have mentioned in the chapter Regularization for Deep Learning,
”the well known LASSO (Tibshirani, 1995) (least absolute shrinkage and selection operator) model
integrates an L1 penalty with a linear model and a least-squares cost function. The L1 penalty causes a
subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.”

5
We’ll see in this problem how. Let x ∈ R. Let a > 0, λ > 0, b, c, x ∈ R and

w∗ = arg min ax2 + bx + c


x
w(α) = arg min ax2 + bx + c + α|x|
x

a) Compute w(α) in terms of α, a, w∗ . What do you conclude?


b) Now let’s consider multiple features. Let A = diag(a1 , a2 . . . , an ), ai > 0 ∀i and b ∈ Rn .

w∗ = arg min xT Ax + bT x + c
x∈Rn

w(α) = arg min xT Ax + bT x + c + α ∥x∥1


x∈Rn

Show that
α
w(α)i = sign(wi ) max(0, |wi | − )
ai
What happens to components of w(α) corresponding to smaller diagonal entries of A? How can
this be used for feature selection?

2.11 Limit of Locally Weighted Linear Regression


Recall Locally Weighted Linear Regression done in class. We’ll understand what happens in the limit,
hyper-parameter σ → ∞ and σ → 0+ . Consider points x1 < x2 < x3 ....xm ∈ R and λ > 0 be fixed. We
need to regress over points (x1 , y1 ), (x2 , y2 ) . . . (xm , ym ). The loss function, for a point x ∈ R is

(x−xi )2
e− 2σ 2
ασ,i (x) = (x−xj )2
Pm
j=1 e− 2σ 2

ŷi = wxi + b
m
1 X
Lσ (x, w, b) = ασ,i (x)(yi − ŷi )2 + λw2
m i=1

Denote

wσ (x), bσ (x) = arg min Lσ (x, w, b)


w,b

fσ (x) = wσ (x)x + bσ (x)

a) Obtain f∞ (x) = limσ→∞ fσ (x).


b) Obtain f0 (x) = limσ→0+ fσ (x).

6
2.12 Unique Minima
In class we discussed that the problem of Linear Regression has a unique optimum. We’ll try to prove
that formally and see what happens to stationary points when convexity is not guaranteed. Consider
the function
1 T
x Ax − bT x + c
f (x) = (5)
2
Where c ∈ R, x ∈ Rn , b ∈ Rn , A ∈ Rn×n and is Symmetric. All eigenvalues of A are positive. We
define x∗ ∈ Rn such that ∇x f |x=x∗ = 0.

a) Show that A is invertible and hence x∗ exists uniquely.

f (x) − f (x∗ ) > 0 ∀x ∈ Rn − {x∗ } (6)

b) We’ll now prove Equation 6. Set x = x∗ + ϵz, zT z = 1, ϵ > 0 in the equation, and show that

ϵ2 T
f (x∗ + ϵz) − f (x∗ ) = z Az (7)
2

Assume the fact, ”Every n × n Real Symmetric Matrix is Diagonalizable and has a set of n
Orthonormal Eigenvectors”. Express z as
n
X
z= ai vi
i=1

Where viT vi = δij and λi be the Eigenvalue of A corresponding to vi . Show that


n
ϵ2 X 2
f (x∗ + ϵz) − f (x∗ ) = a λi (8)
2 i=1 i

c) Now, let all eigenvalues of A be non-zero and not all of the same sign. Obtain explicitly, atleast
one z ∈ Rn such that ∥z∥2 = 1 and f (x∗ + ϵz) − f (x∗ ) < 0 for some ϵ > 0.
d) Show that the Linear Regression objective can be expressed Equation 5. Does the condition, ”A
is Symmetric. All eigenvalues of A are positive” assumed here always hold in linear Regression?
Clearly describe the condition on the data when this isn’t true. We’ll derive these conditions in
Multiple Minimas.

2.13 Multiple Minimas


In this problem we’ll re what happens when A in Unique Minima has some eigenvalues 0. Consider
the data matrices, X ∈ Rm×n and Y ∈ Rm .
 (1)T   (1) 
x y
 x(2)T   y (2) 
X= .  Y= . 
   
 ..   .. 
x(m)T y (m)

7
Absorb the bias term in Linear Regression into x(i) ’s. We derived the condition for the optimal w∗
as (derive it yourself if you don’t remember this),

XT Xw∗ = XT y (9)

When XT X is invertible, we get


w∗ = (XT X)−1 XT y (10)
We’ll now examine precisely when is XT X not invertible.

a) Show that Equation 9 has atleast one solution.


b) Show that all w∗ satisfying Equation 9 lead to the same LSSE (w∗ ) value.
c) Show that a square matrix A ∈ Rn×n is not invertible if and only if Av = 0 for some non-zero
v ∈ Rn .
d) For any v ∈ Rn , show that
XT Xv = 0 ⇐⇒ Xv = 0 (11)
1
If you need a hint, see this . If you’re up for a challenge (this will not be used in later parts), let
C ∈ Rm×m be a fixed, Symmetric Positive Definite matrix, show that

XT CXv = 0 ⇐⇒ Xv = 0

e) Now let XT X be non-invertible. Using the above part, what do you conclude about the columns of
X, the data matrix (note that there’s also the column of all 1’s in X). Let w1∗ be one w minimizing
LSSE (w), obtain another.
f) Continuing item e), if you have to choose a subset of the initial n features, for regression, without
losing any information, how would you do that?
g) Following, the above part, can you realize why assuming A in Unique Minima to be positive
definite, instead of positive semi-definite, is reasonable and doesn’t lead to loss of generality?
h) Redo all above parts for Weighted Linear Regression, with example-wise weights, u1 , u2 ...um , all
positive. You’ll see all these results hold true for this case too!

2.14 Is this Linear Regression?


Consider this regression problem. Here n > 4 and the prediction

ŷ (i) = wT x(i)

We have 4 Linearly Independent Constraints on w represented as

Mw = 0

i.e. M ∈ R4×n and rank(M) = 4. Under the above constraints obtain the optimum value of w that
minimizes these loss functions.

a)
LSSE (w)
1 For z∈ Rn , zT z = 0 ⇐⇒ z = 0.

8
b)
LSSE (w) + λwT Aw

Where λ ∈ R+ and A is a Symmetric Positive Definite Matrix, are constants.

Note that it is not intended that one uses Lagrangian Multipliers to solve this.

9
3 Neural Networks
3.1 Backpropagation in Neural Networks
Following the terminology used in class. Write Backpropagation for Neural Networks yourself and
hence, the update rule for each weight and bias in the network. Consider the r class classification problem
with Softmax in the last layer and the Cross Entropy loss function.

3.2 Backpropagation in CNN’s


Can you point out what makes backpropagation in CNN’s more difficult/ worrisome that vanilla
neural networks? Parameter Sharing! The weight (w) and bias (b) matrices are shared by every
”patch” in a feature map. Consider the input feature map x of size h1 × w1 × d1 . Denote the loss
function on the final output as L. We have the following layers.
• Layer 1: The convolution filter of height and width a each with stride s runs on x. Assume s
divides (h1 − a) and (w1 − a). The output feature map, called ẑ has depth d2 .
– Write down the exact dimensions of w, b and ẑ.
– Write the expression for ẑijk for each feature in ẑ.
∂L ∂L ∂L ∂L
– Assuming for all i, j, k in the size if ẑ, ∂ẑijk is known, obtain ∂wpqrk , ∂brk and ∂xabr .

• Layer 2: Assume ẑ has size h2 × w2 × d2 . A max pooling filter of shape a × a with stride a runs
over ẑ to produce z. Assume a divides h1 and w1 .
– Write down the exact dimensions of z and the expression for each zijk in terms of ẑ.
∂L ∂L
– Assuming for all i, j, k in the size of z, ∂zijk is known, obtain ∂ẑpqk .

• Layer 3: Assume z has size h3 × w3 × d3 . The Sigmoid activation is applied to z pointwise, to


obtain y.
– Write down the exact dimensions of y and the expression for each yijk in terms of z.
∂L ∂L
– Assuming for all i, j, k in the size of y, ∂yijk is known, obtain ∂zpqk .

• Realize that ”sliding” the Convolution Filter over the input feature map is just for intuition. All
features in the output feature map can be computed in parallel.

3.3 Building complex Decision Boundaries


We’ll denote a neuron with n inputs as fn i.e. fn : Rn → {0, 1} and fn is designed using weight
w ∈ Rn and bias b. For x ∈ Rn ,
fn (x) = 1{wT x+b>0}
For the below problems we’ll ignore classifying on exactly the Decision Boundary.

a) Using a single neuron, implement the the OR and AND gates i.e. when x ∈ {0, 1}n , fn (x) should
output the corresponding boolean logic.

b) Realize that using a single neuron, fn you can implement every possible linear decision boundary in
Rn . Note that for a single geometric hyperplane in Rn , there are two possible decision boundaries
(See Figure 2 and Figure 3).

10
c) If you want to partition R2 into different regions using straight lines, each classified as one of
{0, 1}, you can draw these lines and use boolean logic to model on which side a point in R2 lies
w.r.t each line (one of {0, 1}). Taking ideas from the above parts, draw a 2 Layer network with
atmost 3 neurons that can draw every decision boundary shaped like Figure 5 and Figure 4. Can
the same network be use to draw decision boundaries Figure 2 and Figure 3. Can this network
draw Figure 6?
d) Show why you can’t implement the XOR gate using a single neuron. Now, use two layers of neurons
(atmost 3 neurons in total) to implement XOR, in case of x ∈ {0, 1}2 .
e) Using atmost 3 layers and 5 neurons, implement all decision boundaries like Figure 6. Will this
also represent all decision boundaries Figure 2 through Figure 6.

f) Think about the minimum number of layers and neurons needed to implement any boolean function
of n variables. Using the ideas above, can you design a Neural Network that can perfectly classify
a m points in R2 , without knowing their positions and labels beforehand.

11
Figure 2: Boundary 1 Figure 3: Boundary 2

Figure 4: Boundary 3 Figure 5: Boundary 4

12
Figure 6: Boundary 5

3.4 Dropouts in Linear Regression


[Goodfellow et al., 2016] have mentioned in the chapter Regularization for Deep Learning,

”when applied to linear regression, dropout is equivalent to L2 weight decay, with a different weight
decay coefficient for each input feature”

We’ll prove this here. Consider data, {(x(i) , y (i) )}m


i=1 , w, b with dimensions as described in Introduc-
tion. We’ll apply dropouts on the input features xj to the model. We’ll call our model M. In training
mode, for each forward pass each of the n features is independently used with fixed probability p ∈ (0, 1),
otherwise dropped. During evaluation, all the features are used but the corresponding weight is scaled
according to wi → pwi , as done in dropouts to prevent inflating the activation (here Identity).

a) While training, notice that the output M(x) is a random variable. Show that the model output
can be expressed precisely as
M(x) = wT (x ⊙ d) + b
Where ⊙ represents the element-wise product and d ∈ {0, 1}n is a random vector.
b) Obtain the probability distribution of d. Hence compute E[d] and E[ddT ].

13
c) Prove these trivial properties, x is a constant.

(x ⊙ d)(x ⊙ d)T = (xxT ) ⊙ (ddT )


E[x ⊙ d] = x ⊙ E[d]
X X
(d ⊙ x) = d ⊙ ( x)
x x

d) Now, while training we’ll use the SSE loss i.e.

L(i) (w, b) = (M(x(i) ) − y (i) )2


m
1 X (i)
L(w, b) = L (w, b)
m i=1

Notice that L(w, b) is a random variable. Show that


m m
1 X X
E[L(w, b)] = (pwT x(i) + b − y (i) )2 + p(1 − p) αj wj2
m i=1 j=1
m
1 X (i) 2
where αj = (x )
m i=1 j

e) Express L(w, b) as L̃(w̃, b) where w̃ = pw (because finally we’ll use w̃ for evaluation). Assume all
n features have been shifted to zero data mean. Show that
m m
1 X T (i) 1−pX 2 2
E[L̃(w̃, b)] = (w̃ x + b − y (i) )2 + σ w̃ (12)
m i=0 p j=1 j j

Where σj is the data standard deviation of the j th feature.

f) Observe that w̃ = pw =⇒ L(w, b) = L̃(w̃, b), hence minimizing L(w, b) over (w, b) is equivalent to
minimizing L̃(w̃, b) over (w̃, b). Clearly Equation 12 shows that using dropouts in Linear Regression
has the effect of L2 regularization on w̃. What can you say about the penalty corresponding to
two different features relatively?
g) Think about the strength of regularization in the limits, p → 0+ and p → 1− . Was this expected?

14
4 Gaussian Discriminant Analysis and Naive Bayes
4.1 Distribution of Multivariate Gaussian
 T
Let random vector X = X1 X2 . . . Xn . Let Z ∈ Rl be the standard normal random
vector (i.e. Zi ∼ N (0, 1) for i = 1, . . . , ℓ are i.i.d.). Then X1 , . . . , Xn are jointly Gaussian if there
exist µ ∈ Rn , A ∈ Rn×l such that X = Az + µ.

We will see how this leads to the famous multivariate Gaussian distribution when the covariance
matrix Σ is invertible. The joint moment generation function of random variables X1 , X2 ...Xn is defined
as
M (t1 , t2 , . . . , tn ) = E[et1 X1 ,...,tn Xn ]
[Ross, 2014] in ”Chapter 7 Properties of Expectation” has mentioned

It can be proven (although the proof is too advanced for this text) that the joint moment generating
function M (t1 , . . . , tn ) uniquely determines the joint distribution of X1 , . . . , Xn .
 T
Denote t = t1 t2 . . . tn . Given the facts

σ2
Y ∼ N (µ, σ 2 ) =⇒ E[exp (Y )] = exp (µ + )
Z Z 2
f (x)dn x = f (Qα + b)|Q|dn α
x∈Rn α∈Rn

where b ∈ Rn and Q ∈ Rn×n is invertible.

a) Prove that
tT AAT t
E[exp (tT X)] = exp (tT µ + )
2
b) Prove that if PDF of X is

1 −(x − µ)T Σ−1 (x − µ)


fX (x) = 1 n exp ( )
|Σ| (2π)
2 2 2

then
tT Σt
E[exp (tT X)] = exp (tT µ + )
2
If you need a hint see this.2

4.2 GDA: Introduction


The GDA model learns the parameters ϕ ∈ (0, 1), µ0 , µ1 ∈ Rn , and Σ0 , Σ1 ∈ Rn×n . Σ0 , Σ1 should
be symmetric positive definite. Here x ∈ Rn .

y ∼ Bernoulli(ϕ)
x|y = 0 ∼ N (µ0 , Σ0 )
x|y = 1 ∼ N (µ1 , Σ1 )
2 Σ is symmetric positive definite, so it has a unique symmetric positive definite square root, Write the integral and
1
substitute x = Σ 2 α + (µ + Σt).

15
4.3 Decision Boundary in GDA
Show that the decision boundary of GDA has equation
s
1 T −1 1 T −1 1−ϕ |Σ1 |
x (Σ1 − Σ−1 T −1 T −1 T −1
0 )x − (µ1 Σ1 − µ0 Σ0 )x + (µ1 Σ1 µ1 − µ0 Σ0 µ0 ) + ln( )=0
2 2 ϕ |Σ0 |
Give an example of values of ϕ, µ0 , µ1 , Σ0 , Σ1 so that the decision boundary has equation
wT x + b = 0

4.4 Logistic Regression is robust!


[Ng and Ma, 2023] mentioned in Generative learning algorithms
”There are many different sets of assumptions that would lead to p(y|x) taking the form of a logistic
function.”
a) Consider the modeling assumptions of GDA: Introduction. Find the expression for p(y|X = x).
Constrain Σ0 = Σ1 = Σ. Show that it takes the logistic form
1
p(y = 1|x; ϕ, µ0 , µ1 , Σ) =
1+ e−(wT x+b)
b) Let λ0 , λ1 > 0, 0 < ϕ < 1 and
y ∼ Bernoulli(ϕ)
X|y = 0 ∼ Poisson(λ0 )
X|y = 1 ∼ Poisson(λ1 )
Show that p(y|X = x) takes the logistic form
1
p(y = 1|X = x; ϕ, λ0 , λ1 ) =
1 + e−(wx+b)
c) Despite the result in (a), training Logistic Regression and GDA on the same data, don’t learn the
same boundary. Why? Hint: Write down the respective loss functions.

4.5 Probabilistic Interpretation of Laplace Smoothing


See 4.2.2 Event models for text classification in [Ng and Ma, 2023]. We have a training set
(i) (i)
{(x(i) , y (i) ); i = 1, . . . , m} where x(i) = (x1 , . . . , xdn ). Here, di is the number of words in the training
example ith . The parameters ϕ, ϕk0 , ϕk1 , k ∈ {1, 2, . . . , |V |} have meaning as discussed in class. Denote
Φ0 = (ϕ10 , . . . , ϕ|V |0 ) and Φ1 = (ϕ11 , . . . , ϕ|V |1 ). We will treat them as random variables with the prior
distributions below and obtain the MAP estimates.
ϕ ∼ U(0, 1)
 
|V |
Y
p(Φ0 ) ∝  (ϕk0 )a0  for 0 < ϕk0 < 1, Σϕk0 = 1
k=1
 
|V |
Y
p(Φ1 ) ∝  (ϕk1 )a1  for 0 < ϕk1 < 1, Σϕk1 = 1
k=1

16
Where a0 , a1 > 0 are fixed. Φ0 and Φ1 follow the distributions, Dirichlet(a0 + 1, . . . , a0 + 1) and
Dirichlet(a1 + 1, . . . , a1 + 1), more popularly the Beta distribution, when |V | = 2. Assume that
• All training examples are independent.
• There is atleast one example of each class.

• Each word in vocabulary V is present in atleast one example of each class.


• ϕ, Φ0 and Φ1 are independent.

a) Show that
Qm
(i) (i) p(ϕ)p(Φ0 )p(Φ1 ) i=1 p(x(i) , y (i) |ϕ, Φ0 , Φ1 ))
p(ϕ, Φ0 , Φ1 |{(x , y )}m
i=1 ) =
p({(x(i) , y (i) )}m
i=1 )

b) The naive Bayes assumption is that the words (random variables) in different positions X1 , X2 , . . . , Xdi
give y are independent. Show that
 
di |V | (i) (i)
(i) (i)
(i) (i) Y Y 1{x =k∧y =0} 1{xj =k∧y =1}
p(x(i) , y (i) |ϕ, Φ0 , Φ1 ) = ϕy (1 − ϕ)(1−y )  ϕ j ϕk0

k1
j=1 k=1

c) Show that arg maxϕ,Φ0 ,Φ1 p(ϕ, Φ0 , Φ1 |{(x(i) , y (i) )}m


i=1 ) gives

Pm Pdi (i)
a0 + j=1 1{xj =
i=1 k ∧ y (i) = 0}
ϕk0 = Pm
a0 |V | + i=1 1{y (i) = 0}di
Pm Pdi (i)
a1 + i=1 j=1 1{xj = k ∧ y (i) = 1}
ϕk1 = Pm
a1 |V | + i=1 1{y (i) = 1}di
Pm (i)
i=1 1{y = 1}
ϕ=
m
P|V | P|V |
Keep in mind that the parameters are constrained, k=1 ϕk0 = k=1 ϕk1 = 1

17
5 Principal Component Analysis
5.1 Variance and covariance of projected features
Consider the data D = {x(i) }m n
i=1 . Let u, v ∈ R be such that ∥u∥2 = ∥v∥2 = 1. Σ ∈ R
n×n
is the
covariance if D. We now look at features formed by projecting D along the directions u and v.
a) Show that
Var(uT x(i) ) = uT Σu
Cov(uT x(i) , vT x(i) ) = uT Σv

b) Let v1 , v2 , . . . , vn be eigenvectors of Σ, chosen orthonormal. Show that


Cov(viT x(i) , vjT x(i) ) = 0, i ̸= j
Xn X n
Var(vjT x(i) ) = Var(eTj x(i) )
j=1 j=1
 T  T
By finding Var(viT x(i) ) explicitly. Here e1 , e2 , . . . are respectively 1 0 ... 0 , 0 1 ... 0
and so on.

5.2 Projection Loss and maximizing variance


Pm
Consider the data D = {(x(i) , y (i) )}m
i=1 . x
(i)
∈ Rn , y (i) ∈ R Assume i=1 x(i) = µ.
We have k ≤ n and the set of constraints,
ui ∈ Rn (13)
uTi ui = 1 i ∈ {1, 2, . . . , k} (14)
uTi uj = 0 i, j ∈ {1, 2, . . . , k} i ̸= j (15)
The projection x̂(i) of x(i) is given as
k
X
(i)
x̂ =µ+ uTj (x(i) − µ)
j=1

1
Pm (i)
Denote x̂ = m i=1 x̂ . We have the two optimization problems (under the constraints given before)
1. Maximizing projected variance

m 2
1 X (i)
max x̂ − x̂ (16)
u1 ,u2 ,...,uk m · n i=1 2

2. Minimizing projection loss


m
X 2
min x(i) − x̂(i) (17)
u1 ,u2 ,...,uk 2
i=1

a) Assume all x(i) ’s are exactly the same (i.e. the total variance in 0) but all y (i) ’s are distinct. Can
you learn anything from this data? Do you realize why variance in the data is related to the
information it gives?
b) Show that both the problems above are equivalent. That is any set {u1 , u2 , . . . , uk } maximizing
Equation 16 minimizes Equation 17 and vice-versa.

18
5.3 Is the solution unique?
Look at the optimization problem described in Projection Loss and maximizing variance for mini-
mizing projection loss, Equation 17. {u1 , u2 , . . . , uk } is a set of orthonormal vectors. Denote U ∈ Rn×k
as the matrix with columns ui ’s in that order. Let A ∈ Rk×k be orthonormal, and U′ = UA. Let
{u′1 , u′2 , . . . , u′k } be the columns of U′ .
a) Show that {u′1 , u′2 , . . . , u′k } are orthonormal.
b) Show that both {u1 , u2 , . . . , uk } and {u′1 , u′2 , . . . , u′k } give the same objective value in Equation 17.
Since you already know, taking the eigenvectors of Σ, the data covariance as {u1 , u2 , . . . , uk } minimizes
the projection loss. Now can you comment if this is the only solution to the problem of minimizing
projection loss? We asked you the case, k = 1 of this problem in Quiz 10!

With some more work, you can show that span({u1 , u2 , . . . , uk }) = span({u′1 , u′2 , . . . , u′k }). Using this,
grasp the fact that while minimizing projection loss, what we’re actually looking for is a subspace to
project the data onto. The Principal Components are actually a more informative representation of this
subspace, because for every k ≤ n, the span of the top k eigenvectors is the optimal subspace.

6 Decision Trees
6.1 Fast Greedy Pruning
Consider data of the form {(x(i) , y (i) )}m
i=1 where x
(i)
is an n-dimensional feature vector and y (i) ∈
{0, 1}. Denote the validation set by V . We wish to implement the Greedy Pruning algorithm to prune
a trained decision tree T using the validation set V .

In each iteration, we consider all non-leaf nodes in T , and remove the entire subtree of a node (hence
making it the leaf) that leads to maximum increase in accuracy on V . The process ends when any
remaining nodes in the tree lead to a decrease in accuracy on V .

Naively implementing this algorithm takes O(niters · |V | · |T |2 ) time. Here, in each iteration we tem-
porarily prune a node and run V through the tree, to get accuracy on |V |. Finally we actually prune
the node based on the rule described above.
Think of an algorithm that can do this in O(|V | · depth(T ) + |T | + niters · |T | · depth(T )) time. Realize
why this is a very great speed up.

You can find a well-tested Python implementation of this algorithm here [Harshit0143, 2023]. Hints
for the Algorithm
1. During the pruning process, the split rule at every node remains the same. The set of examples
from V that reach any node of T during pruning remains exactly the same.
2. Store n0 , n1 , ntrue , the number of examples from class 0, class 1 from |V | that reach this node
and number of examples from these that are correctly classified, respectively. This can be done in
O(|V | · depth(T ) + |T |) time.
3. If you have to prune a non-leaf node ver, then n0 , n1 remains unchanged throughout the tree. For
which nodes does ntrue change? You can update this value for all such nodes in total O(depth(T ))
time.

19
7 Support Vector Machines
7.1 SVM Kernels: But why bother duals?

Figure 7: Classification data for SVM

1 2
min ∥w∥2
w,b 2
s.t. y (i) (wT x(i) + b) ≥ 1, i ∈ {1, 2, . . . , m}

Figure 8: SVM: Primal Problem

m m
X 1 X (i) (j)
max W (α) = αi − y y αi αj ⟨x(i) , x(j) ⟩
α
i=1
2 i,j=1
s.t. αi ≥ 0, i ∈ {1, 2, . . . , m}
m
X
αi y (i) = 0.
i=1

Figure 9: SVM: Dual Problem

Look at the data in Figure 7. The feature vector is expressed as x = [x1 x2 ]T . The data is generated
by sampling x from N (0, I) and filtering out points, to |x1 | ≤ 5 and |x2 | ≤ 5. Points within the circle

20
with center (0, 0) and radius 3 are labeled red and those outside that of radius 4 are labeled blue. Other
points are ignored.

a) Can simple SVM’s (without using kernels) be used to classify the data perfectly in Figure 7.
b) Suggest a transformation ϕ : R2 → Rk such that when applied to the feature vectors in the data
first, the data in Figure 7 can be perfectly classified using SVM.
c) When the number of features is 3, to be able to learn any quadratic decision boundary (using SVM
in Figure 8), you can use the transform

ϕ([x1 x2 x3 ]T ) = [x21 x22 x23 x1 x2 x2 x3 x3 x1 x1 x2 x3 ]T

Think of a ϕ that can be used when x ∈ Rn and you’ve to be able to learn all boundaries of degree
≤ d. What is the size of the output vector?3

d) Continuing the previous part, you have a Black Box that can solve Figure 8. What is the size of
this problem, i.e. the number of coefficients in the objective and constraints. How much time does
it take to compute all of them?
e) In the dual problem, notice that x appears only in the inner product. You have been given a ϕ
that maps x ∈ Rn to space with features, all products of x1 , x2 , . . . , xn of degree ≤ d.

⟨ϕ(x), ϕ(z)⟩ = (xT z + c)d (18)

c is a hyperparameter that decides weight for terms of lower degrees. You have a Black Box that
can solve Figure 9. Do you realize that, given Equation 18, you do not need to worry about the
exact form of ϕ? What is the size of the problem now? How much time does it take to compute
all the coefficients, without (a rough lower bound) and with the kernel trick in Equation 18?

7.2 Soft Margin SVM


We have data D = {(x(i) , y (i) )}m
i=1 where x
(i)
∈ Rn and y (i) ∈ {−1, 1}. Consider the soft SVM
optimization problem
m
1 wT w C X
min LC (w, b, ϵ) = + ϵi
w,b,ϵ 1+C 2 1 + C i=1
Under constraints

y (i) (wT x(i) + b) ≥ 1 − ϵi i ∈ {1, 2, . . . , m}


ϵi ≥ 0 i ∈ {1, 2, . . . , m}

Denote ϵ = [ϵ1 ϵ2 . . . ϵm ]T . C ≥ 0 is a hyperparameter. In each of the below cases, give the


optimum value of the objectives and construct the exact values of w, b, ϵ that lead to that objective.
a) L0 (w, b, ϵ)
b) L∞ (w, b, ϵ) when D is linearly separable.
c) Comment on the value of objective L∞ (w, b, ϵ) when D is not linearly separable.

3 Hint: The number of terms in the expansion of (x1 + x2 + · · · + xn + c)d , c is a constant.

21
References
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT
Press. http://www.deeplearningbook.org.
[Harshit0143, 2023] Harshit0143 (2023). COL774 Fall 2023-2024 Assignments. https://github.com/
Harshit0143/Machine-Learning-Assignments.

[Ng and Ma, 2023] Ng, A. and Ma, T. (June 11, 2023). Cs229 lecture notes. https://cs229.stanford.
edu/main_notes.pdf.
[Petersen and Pedersen, 2021] Petersen, K. B. and Pedersen, M. S. (November 2021). The matrix cook-
book. https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.
[Ross, 2014] Ross, S. M. (2014). A First Course in Probability. Pearson, Boston, 9th edition.

22

You might also like