Coursehero 34726625 PDF

COMP5421/MSBD6000C Computer Vision
Spring Semester 2018

Final Exam
Saturday, May 19, 2018, 3:00PM – 6:00PM at Lecture Theater D
Instructor: Chi Keung Tang
This is a CLOSED-BOOK-CLOSED-NOTES exam. You may use scientific calculator

allowed in public exams such as SAT, but no phone, laptop or computer.
Part I (20%) consists of 30 questions randomly drawn from the given 100 easy, factual
questions. You can score up to 1 point for answering each question of your choice. The
maximum number of points you can score in Part I is capped at 20 points.
Part II (100%) consists of 6 questions. Write all your answers in the answer booklet
provided. Unless otherwise stated only the simplest expression scores full credit in math
problems.
You may score 2 additional points if you stay in the venue (i.e., no restroom break) during
the entire exam.
Part I
87. Outline the “multiway cut algorithm” that makes use of the maxflow algorithm in each
iteration for stereo reconstruction.
98. Let F (x, θ) be the decision function and y(x) be the true label. Define the hinge loss
of a linear SVM.
16. Describe by listing the steps in voxel coloring algorithm proposed by Seitz and Dyer
that computes a photo-consistent reconstruction, when viewpoints are constrained.
What is the crucial property of the camera arrangement to ensure the correctness of
the algorithm?
23. What is a Gaussian Pyramid? What are the two main steps to construct a Gaussian
Pyramid?
74. What is a first-order Markov Random Field in the discrete 2D space? Illustrate using
a diagram.
18. Describe by listing the main steps of the space carving algorithm discussed in class.
What are the two lemmas we covered in class to ensure the convergence of the algo-
rithm?
9. Contrast forward mapping and backward mapping by illustrative examples.
84. Consider the first-order MRF. On what condition the alpha expansion can be used in
graph cuts and why it is a faster algorithm?
15. State no less than 3 advantages and disadvantages (combined) of volume intersection
(voxel coloring given silhouettes).
24. What is a linear shift-invariant (LSI) filter? Is convolution a LSI filter?
1
39. Derive the expression for vanishing points on the 2D image plane.
14. Consider the reconstruction from silhouette algorithm that colors voxel by performing
volume intersection in the 3D space. What do we get in the limit? What is the time
complexity of this algorithm, given M images and N 3 voxels?
51. What is an epipole? What is its significance?
68. What is principal component analysis (PCA) for estimation in N -dimension? Define
using an equation. Explain in one sentence how PCA is used to reduce the dimension
of the parameter space which can be prohibitive.
78. The stereo reconstruction can be formulated into one of graph cuts. (a) How do you
map to the vertices and links of the underlying graph? (b) What is the label set?
34. Outline histogram-based segmentation. In your opinion, what are the advantages and
disadvantages?
79. Write down the MRF energy minimization function E(f ) for estimating disparity,
where fp is the configuration at a pixel p, O is the observed data, and V is the com-
patibility function.
91. In the general structure and motion problem, given n matching image points and m
views, (a) for each camera, how many parameters are to be estimated (if translation
magnitude can also be estimated)? (b) for each 3D point, how many parameters are
to be estimated?
20. Explain scaled orthographic projection, and state no less than two conditions scaled
orthographic projection can be used to approximate perspective projection.
81. Draw the linear clique potential model on the whiteboard and explain the intuition
behind its design.
75. What will be the effect of the window size in non-parametric texture synthesis?
89. There are two steps in synthesizing one pixel in non-parametric texture synthesis,
namely, a) find matching neighborhood b) how to synthesize the pixel How will you
modify the above into a block-based texture synthesis? What additional consideration
needs to be used in order to avoid noticeable artifacts such as seams?
5. What is the barber pole illusion? Explain its cause in the context of motion estimation.
48. Describe the three cases represented by the eigensystem of ∇I∇I T .
56. What is an affine camera? Answer this question by writing down an affine camera
using inhomogeneous coordinates described in class.
19. Explain the visibility lemma on the white board.
50. Give three points that define an epipolar plane. Define an epipolar line using the
epipolar plane.
2
32. Name no less than two Gestalt laws of perceptual organization. Explain them with
illustrative examples.
37. Write out the derivation of homography and show that at least 4 correspondences are
needed to determine the homography matrix.
30. Name no less than two origins of edges we observe in images and explain what they
are.
Part II
1 Optical Flows (10%)

Draw the optical flow patterns for different types of camera motions by using many small
arrows whose length indicates the local speed of motion. For example, if a camera is placed
on a moving train and is facing to the right, the motion pattern may look like this:
The reason for this pattern is that there is only slow motion relative to the camera for
objects that are far away, and faster motion for objects close by. Please illustrate the optical
flow patterns for the following situations in a similar way:
(a) The camera is moving forward (i.e., you are pointing the camera in your direction of
motion while walking forward).
(b) The camera is rotating clockwise along its vertical axis (e.g., while holding the cam-
era you are turning your body to the right as if tracking an object that is moving
rightward).
3
(c) The camera is rotating clockwise along its visual axis (e.g., while holding the camera
you are turning it clockwise so that you are switching from “landscape” to “portrait”
orientation of the video).
(d) Just like (c), but now you are walking forward while rotating the camera.
2 Hough Transform and RANSAC (20%)

After running your favorite stereo algorithm assume you have produced a dense depth map
such that for each pixel in the input image you have its associated scene points (X, Y, Z)
coordinates in the camera coordinate frame. Assume the image is of a scene that contains
a single dominant plane (e.g., the front wall of a building) at unknown orientation, plus
smaller numbers of other scene points (e.g., from trees, poles and a street) that are not part
of this plane. As you know, the plane equation is given by ax + by + cz + d = 0.
(a) Define a Hough transform based algorithm for detecting the orientation of the plane in
the scene. That is, define the dimensions of your Hough space, a procedure for mapping
the scene points (i.e., the (X, Y, Z) coordinates for each pixel) into this space, and how
the plane’s orientation is determined.
Assuming the plane is not allowed to pass through the camera coordinate frame origin,
we can divide by d, resulting in three parameters, A = a/d, B = b/d, and C = c/d that
4
define a plane. Therefore the Hough parameter space is three dimensional correspond-
ing to possible values of A, B, and C. Assuming we can bound the range of possible
values of these three parameters, we then take each pixel’s (X, Y, Z) coordinates and
increment all points H(p, q, r) in Hough space that satisfy pX + qY + rZ + 1 = 0. The
point (or small region) in H that has the maximum number of votes determines the
desired scene plane.
(b) Describe how the RANSAC algorithm could be used to detect the orientation of the
plane in the scene from the scene points. RANSAC (random sampling consensus) is an
iterative algorithm that also uses the voting scheme to find the optimal fitting result.
Given a dataset whose data elements contain both inliers (correct orientation in our
case) and outliers (incorrect orientations), data elements in the dataset are used to
vote for one or multiple models (orientation(s)).
Given Step 1 in the following:
Step 1: Randomly pick 3 pixels in the image and, using their (X, Y, Z)
coordinates, compute the plane that is defined by these points.
Step 2: For each of the remaining pixels in the image, compute the distance from its (X,
Y, Z) position to the computed plane and, if it is within a threshold distance, increment a
counter of the number of points (the “inliers”) that agree with the hypothesized plane.
Step 3: Repeat Steps 1 and 2 many times, and then select the triple of points that has the
largest count associated with it.
Step 4: Using the triple of points selected in Step 3 plus all of the other inlier points which
contributed to the count, recompute the best planar fit to all of these points.
3 Errors in Depth Perception (10%)

Having two eyes, we are able to perceive the depth (z-distance) of an object through binocular
disparity information. As you know, we can also give our computer vision system two cameras
and let it do the same thing. The question is: How accurate is its estimation of depth, and
on what factors does this accuracy depend?
The main problem here is the limited accuracy and resolution of the cameras. Let us
say that the actual position of an object in the camera image may be up to one millimeter
to the left or to the right of its actual position. For example, if the left camera measures
xl = 5.3cm, it means that the object’s actual position could be anywhere between xl = 5.2cm
and xl = 5.4cm.
(a) Given this camera accuracy, what is the z-range that an object could have (i.e., the
minimum and maximum z-distance possible), if the cameras with baseline b = 10cm
and focal length f = 20cm measure positions xl = 6.1 and xr = 5.1? If you do not
remember the equation, try to derive it; it is not very difficult.
The most extreme cases are xl = 6.0 and xr = 5.2 and xl = 6.2 and xr = 5.0. In the
first case, we derive distance z as:
z = 10cm(20cm)/1.2cm = 166.67cm
5
In the second case, we derive distance z as:
z = 10cm(20cm)/0.8cm = 250cm
The z-distance ranges from 166.7 to 250 cm.
(b) What do you think will happen if the object is much further away from the system?
Will the error in z-distance measurement (i.e., the z-range) increase or decrease? Why?
The z-distance error will increase, because greater distance the same variation in the
camera image corresponds to greater variation in the z-distance measurement (it can
also be seen from the equations above).
(c) What do you think will happen if we keep the object in the same place as in (a),
but increase the distance between the cameras, i.e., the baseline b? Will the error in
z-distance measurement (i.e., the z-range) increase or decrease? Why?
Now the z-distance error will decrease, because the same variation in the camera image
correspond to smaller variation in the z-distance measurement.
4 3D Affine Registration (20%)

Let {Xi ∈ R3 }N 3
i=1 be a set of points in R that are transformed by a 3D affine transformation
3×3
(A, T ), where A ∈ R and T ∈ R , to produce another set of points {Yi ∈ R3 }N
3
i=1 . Suppose
that the transformed points Yi are corrupted by noise Ei , i.e., Yi = AXi + T + Ei for all
i = 1, · · · , N . Show that the transformation (A, T ) that minimizes the sum of the squared
errors
N
||Yi − AXi − T ||22
X
E(A, T ) =
i=1
is given by T ∗ = Y − A∗ X, A∗ = (Y X T )(XX T )−1 , where X = Xi /N , X = [Xi −

P
X · · · XN − X] and similarly for Y and Y . Show that 4 is the minimum number of points
needed to find the transformation.
Hint: Set the derivative of E w.r.t. T to zero and substitute T ∗ in E.
To minimize the cost E we set the first derivative to zero as:
N N
∂ X X
E(A, T ) = −2 (Yi − AXi − T ) = −2 (Yi − AXi ) + N T
∂T i=1 i=1
or
N
∗ 1 X
T = (Yi − AXi ) = Y − A∗ X
N i=1
Now, we can substitute the translation in the cost function and obtain
N
||Yi − Y − A(Xi − X)||22 = ||Y − AX||2F
X
E(A) =
i=1
The minimization problem of minA ||Y − AX||2F can be achieved by taking the derivative
with respect to A and setting it to zero:
−(Y − AX)X T = 0 → A∗ = (Y X T )(XX T )−1
6
A general 3D affine transformation introduces 3 DOF (degrees of freedom) for the T ∈ R3
and 9 DOF for the linear affine matrix A ∈ R3×3 . Therefore, there are 12 parameters to
estimate. Every 3D point correspondence gives 3 sets of equations. Therefore, at least 4
point correspondences are required (4×3 = 12) for a unique solution over the 12 parameters.
5 Camera Calibration (20%)

A camera is rigidly mounted so that it views a planar table top. A projector is also rigidly
mounted above the table and projects a narrow beam of light onto the table, which is visible
as a point in the image of the table top. The height of the table top is precisely controllable
but otherwise the positions of the camera, projector, and table are unknown. For each of
the following table top heights, the point of light on the table is detected at the following
image pixel coordinates:
Table height Image coordinates
50mm (100,250)
100mm (140,340)
(a) Using a projective camera model specialized for this particular scenario, write a gen-
eral formula that describes the relationship between world coordinates (x) (i.e., table
height), specifying the height of the table top, and image coordinates (u, v), specifying
the pixel coordinates where the point of light is detected. Give your answer using
homogeneous coordinates and a projection matrix containing variables. Note that this
is a 1D to 2D projective transformation.
The beam of light is a 1D line in the world, which projects to a line of points on the
table top, which in turn projects to a line in the images. Therefore this configuration
corresponds to a 1D to 2D projective transformation of the form
   
su p11 p12 " #
 x
 sv  =  p21 p22 
  
1
s p31 p32
(b) For the first table top position given above and using your answer in (a), write out the
explicit equations that are generated by this one observation.
50p11 +p12
100s = 50p11 + p12 or 100 = 50p31 +p32
50p11 +p12
250s = 50p11 + p12 or 250 = 50p31 +p32
s = 50p31 + p32
(c) How many degrees of freedom does this transformation have?

5 because the solution is only defined up to a scale factor in the projection matrix.
(d) How many table top positions and associated images are required to solve for all of the
unknown parameters in the projective camera model?
Each table height yields two equations, so three positions are sufficient.
7
(e) Once the camera is calibrated, given a new unknown height of the table and an as-
sociated image, can the height of the table be uniquely solved for? If so, give the
equation(s) that is/are used. If not, describe briefly why not.
" # " #" #
su p11 p12 x −u
Given u, x can be uniquely determined by = so x = p31p12u−p .
s p31 1 1 11
Note that we could also have used just the v coordinate, though in practice it is a good
idea to use both coordinates together and solve for a least squares solution so as to
minimize measurement errors.
(f) If in each image we only measured the u pixel coordinate of the point of light, could
the camera still be calibrated? If so, how many table top positions are required? If
not, describe briefly why not.
This problem involves a line-to-line transformation, where points on the line of light
project to a line of image points where the beam intersects the table top. Hence if we
know the distance d measured along the image line from an origin, then this can be
represented using a single image parameter, d, as follows:
" # " #" #
sd p q x
=
s r 1 1
But given only the image u coordinate, we cannot compute d. So, unless the line in
the image is parallel to the u axis, the camera cannot be calibrated given only a set of
(u, x) pairs.
(g) If instead of assuming a projective camera for (a), we instead assume an affine camera
model for this problem, write a general formula that describes the relationship between
(x) and (u, v).
" # " #" #
u p11 p12 x
=
v p21 p22 1
(h) When would an affine camera model be appropriate instead of using a projective camera
model? Give one advantage and one disadvantage of using an affine camera instead of
a projective camera for this problem.
An affine camera model is appropriate when the scene has little depth variation relative
to the viewing distance. A major advantage is computational because an affine camera
is linear. A major disadvantage is that it is a weaker model of a true camera and can
only be used in the situations described above where, for example, parallel lines do not
converge in the image.
6 Tensorflow and Backpropagation (20%)

A TensorFlow computation is described by a directed graph. The following figure shows a
typical TensorFlow computation graph. In the graph, each node has zero or more inputs
and outputs. For example, the MatMul node in the left bottom has inputs W and x and
it outputs Wx. Tensors (arbitrary dimensionality arrays such as vectors or matrices) flow
along the edge of graph, e.g. the output of MatMul is fed into Add.
8
The left part of the figure represents the computation graph of forward propagation. The
computation it represents is:
h = ReLU(Wx + b1 )
ŷ = softmax (Uh + b2 )
X
J = CE(y, ŷ) = − yi log ŷi
i
Here ReLU (rectified linear unit) performs element-wise rectified linear function:
ReLU(z) = max(z, 0)
The dimensions of the matrices are:
W ∈ Rm×n x ∈ Rn b1 ∈ Rm U ∈ Rk×m b2 ∈ Rk
(a) Use backpropagation to calculate these four gradients

∂J ∂J ∂J ∂J
∂b2 ∂U ∂b1 ∂W
Hint: You can use the following notation:
(
1 if x > 0,
1{x > 0} =
0 if x ≤ 0.
Using it on a matrix would perform an element-wise operation, e.g.,

" # " #
2 0 1 0
1{ }=
0 3 0 1
9
z2 = Uh + b2
∂J
δ1 = = ŷ − y
∂z2
∂J
= δ1
∂b2
∂J
= δ1 hT
∂U
δ2 = UT δ1 ◦ 1{h > 0}
∂J
= δ2
∂b1
∂J
= δ2 xT
∂W
(b) The right part of the figure represents the computation graph of backpropagation. Each
gradient operation node in this part (e.g. dMatMul) takes as input not only the partial
gradients computed already along the backpropagation path, but also, optionally, the
inputs and outputs of the corresponding operation node in the forward propagation to
calculate gradients. Here, the MatMul node in the bottom left corresponds to dMatMul
node in the bottom right, and the ReLU node corresponds to the dReLU node. Briefly
explain why the corresponding inputs and/or outputs of the forward operation node
are sometimes needed when backpropagating to calculate gradient. Give an example
for each case.
i. Explanation
Because they may be used in the expression for gradient.
ii. Case 1: The input of the corresponding operation node is needed.
If h0 = Wx, and we already know δ = ∂h ∂J ∂J 0 T
0 . Then ∂W0 = δ x . So to calculate
∂J
∂W0
we need the value of x, which is input to the MatMul node.
These answers are also correct: ReLU, CE.
iii. Case 2: The output of the corresponding operation node.
Because σ 0 (x) = σ(x)(1 − σ(x)), calculating backpropagation of sigmoid function
would require the output of sigmoid node.
These answers are also correct: ReLU, tanh
iv. The CE node (softmax cross-entropy) in the graph expects unscaled inputs (they
are not yet exponentiated), since it performs a softmax internally for efficiency.
Give a reason why this is more efficient than a separate softmax and cross-entropy
layer.
Recall that the gradient of J with respect to the unexponentiated inputs to CE
node is ŷ − y.
This is very easy to compute.
We also give full credits to these reasons:
1. Don’t have to first take exp and then take log. Improve numerical stability.
2. y is one hot so we only need to compute ŷk where k is the correct label.
10
– End of Exam –
11

Coursehero 34726625 PDF

Uploaded by

Copyright:

Available Formats

Coursehero 34726625 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Coursehero 34726625 PDF

Uploaded by

Copyright:

Available Formats

COMP5421/MSBD6000C Computer Vision

Spring Semester 2018

This is a CLOSED-BOOK-CLOSED-NOTES exam. You may use scientific calculator

9. Contrast forward mapping and backward mapping by illustrative examples.

24. What is a linear shift-invariant (LSI) filter? Is convolution a LSI filter?

51. What is an epipole? What is its significance?

48. Describe the three cases represented by the eigensystem of ∇I∇I T .

19. Explain the visibility lemma on the white board.

1 Optical Flows (10%)

2 Hough Transform and RANSAC (20%)

3 Errors in Depth Perception (10%)

The z-distance ranges from 166.7 to 250 cm.

4 3D Affine Registration (20%)

is given by T ∗ = Y − A∗ X, A∗ = (Y X T )(XX T )−1 , where X = Xi /N , X = [Xi −

−(Y − AX)X T = 0 → A∗ = (Y X T )(XX T )−1

5 Camera Calibration (20%)

(c) How many degrees of freedom does this transformation have?

6 Tensorflow and Backpropagation (20%)

The dimensions of the matrices are:

(a) Use backpropagation to calculate these four gradients

Using it on a matrix would perform an element-wise operation, e.g.,

You might also like