Coursehero 34726625 PDF
Coursehero 34726625 PDF
Coursehero 34726625 PDF
Part I
87. Outline the “multiway cut algorithm” that makes use of the maxflow algorithm in each
iteration for stereo reconstruction.
98. Let F (x, θ) be the decision function and y(x) be the true label. Define the hinge loss
of a linear SVM.
16. Describe by listing the steps in voxel coloring algorithm proposed by Seitz and Dyer
that computes a photo-consistent reconstruction, when viewpoints are constrained.
What is the crucial property of the camera arrangement to ensure the correctness of
the algorithm?
23. What is a Gaussian Pyramid? What are the two main steps to construct a Gaussian
Pyramid?
74. What is a first-order Markov Random Field in the discrete 2D space? Illustrate using
a diagram.
18. Describe by listing the main steps of the space carving algorithm discussed in class.
What are the two lemmas we covered in class to ensure the convergence of the algo-
rithm?
84. Consider the first-order MRF. On what condition the alpha expansion can be used in
graph cuts and why it is a faster algorithm?
15. State no less than 3 advantages and disadvantages (combined) of volume intersection
(voxel coloring given silhouettes).
1
39. Derive the expression for vanishing points on the 2D image plane.
14. Consider the reconstruction from silhouette algorithm that colors voxel by performing
volume intersection in the 3D space. What do we get in the limit? What is the time
complexity of this algorithm, given M images and N 3 voxels?
68. What is principal component analysis (PCA) for estimation in N -dimension? Define
using an equation. Explain in one sentence how PCA is used to reduce the dimension
of the parameter space which can be prohibitive.
78. The stereo reconstruction can be formulated into one of graph cuts. (a) How do you
map to the vertices and links of the underlying graph? (b) What is the label set?
34. Outline histogram-based segmentation. In your opinion, what are the advantages and
disadvantages?
79. Write down the MRF energy minimization function E(f ) for estimating disparity,
where fp is the configuration at a pixel p, O is the observed data, and V is the com-
patibility function.
91. In the general structure and motion problem, given n matching image points and m
views, (a) for each camera, how many parameters are to be estimated (if translation
magnitude can also be estimated)? (b) for each 3D point, how many parameters are
to be estimated?
20. Explain scaled orthographic projection, and state no less than two conditions scaled
orthographic projection can be used to approximate perspective projection.
81. Draw the linear clique potential model on the whiteboard and explain the intuition
behind its design.
75. What will be the effect of the window size in non-parametric texture synthesis?
89. There are two steps in synthesizing one pixel in non-parametric texture synthesis,
namely, a) find matching neighborhood b) how to synthesize the pixel How will you
modify the above into a block-based texture synthesis? What additional consideration
needs to be used in order to avoid noticeable artifacts such as seams?
5. What is the barber pole illusion? Explain its cause in the context of motion estimation.
56. What is an affine camera? Answer this question by writing down an affine camera
using inhomogeneous coordinates described in class.
50. Give three points that define an epipolar plane. Define an epipolar line using the
epipolar plane.
2
32. Name no less than two Gestalt laws of perceptual organization. Explain them with
illustrative examples.
37. Write out the derivation of homography and show that at least 4 correspondences are
needed to determine the homography matrix.
30. Name no less than two origins of edges we observe in images and explain what they
are.
Part II
The reason for this pattern is that there is only slow motion relative to the camera for
objects that are far away, and faster motion for objects close by. Please illustrate the optical
flow patterns for the following situations in a similar way:
(a) The camera is moving forward (i.e., you are pointing the camera in your direction of
motion while walking forward).
(b) The camera is rotating clockwise along its vertical axis (e.g., while holding the cam-
era you are turning your body to the right as if tracking an object that is moving
rightward).
3
(c) The camera is rotating clockwise along its visual axis (e.g., while holding the camera
you are turning it clockwise so that you are switching from “landscape” to “portrait”
orientation of the video).
(d) Just like (c), but now you are walking forward while rotating the camera.
(a) Define a Hough transform based algorithm for detecting the orientation of the plane in
the scene. That is, define the dimensions of your Hough space, a procedure for mapping
the scene points (i.e., the (X, Y, Z) coordinates for each pixel) into this space, and how
the plane’s orientation is determined.
Assuming the plane is not allowed to pass through the camera coordinate frame origin,
we can divide by d, resulting in three parameters, A = a/d, B = b/d, and C = c/d that
4
define a plane. Therefore the Hough parameter space is three dimensional correspond-
ing to possible values of A, B, and C. Assuming we can bound the range of possible
values of these three parameters, we then take each pixel’s (X, Y, Z) coordinates and
increment all points H(p, q, r) in Hough space that satisfy pX + qY + rZ + 1 = 0. The
point (or small region) in H that has the maximum number of votes determines the
desired scene plane.
(b) Describe how the RANSAC algorithm could be used to detect the orientation of the
plane in the scene from the scene points. RANSAC (random sampling consensus) is an
iterative algorithm that also uses the voting scheme to find the optimal fitting result.
Given a dataset whose data elements contain both inliers (correct orientation in our
case) and outliers (incorrect orientations), data elements in the dataset are used to
vote for one or multiple models (orientation(s)).
Given Step 1 in the following:
Step 1: Randomly pick 3 pixels in the image and, using their (X, Y, Z)
coordinates, compute the plane that is defined by these points.
Step 2: For each of the remaining pixels in the image, compute the distance from its (X,
Y, Z) position to the computed plane and, if it is within a threshold distance, increment a
counter of the number of points (the “inliers”) that agree with the hypothesized plane.
Step 3: Repeat Steps 1 and 2 many times, and then select the triple of points that has the
largest count associated with it.
Step 4: Using the triple of points selected in Step 3 plus all of the other inlier points which
contributed to the count, recompute the best planar fit to all of these points.
(a) Given this camera accuracy, what is the z-range that an object could have (i.e., the
minimum and maximum z-distance possible), if the cameras with baseline b = 10cm
and focal length f = 20cm measure positions xl = 6.1 and xr = 5.1? If you do not
remember the equation, try to derive it; it is not very difficult.
The most extreme cases are xl = 6.0 and xr = 5.2 and xl = 6.2 and xr = 5.0. In the
first case, we derive distance z as:
z = 10cm(20cm)/1.2cm = 166.67cm
5
In the second case, we derive distance z as:
z = 10cm(20cm)/0.8cm = 250cm
(b) What do you think will happen if the object is much further away from the system?
Will the error in z-distance measurement (i.e., the z-range) increase or decrease? Why?
The z-distance error will increase, because greater distance the same variation in the
camera image corresponds to greater variation in the z-distance measurement (it can
also be seen from the equations above).
(c) What do you think will happen if we keep the object in the same place as in (a),
but increase the distance between the cameras, i.e., the baseline b? Will the error in
z-distance measurement (i.e., the z-range) increase or decrease? Why?
Now the z-distance error will decrease, because the same variation in the camera image
correspond to smaller variation in the z-distance measurement.
X · · · XN − X] and similarly for Y and Y . Show that 4 is the minimum number of points
needed to find the transformation.
Hint: Set the derivative of E w.r.t. T to zero and substitute T ∗ in E.
To minimize the cost E we set the first derivative to zero as:
N N
∂ X X
E(A, T ) = −2 (Yi − AXi − T ) = −2 (Yi − AXi ) + N T
∂T i=1 i=1
or
N
∗ 1 X
T = (Yi − AXi ) = Y − A∗ X
N i=1
Now, we can substitute the translation in the cost function and obtain
N
||Yi − Y − A(Xi − X)||22 = ||Y − AX||2F
X
E(A) =
i=1
The minimization problem of minA ||Y − AX||2F can be achieved by taking the derivative
with respect to A and setting it to zero:
6
A general 3D affine transformation introduces 3 DOF (degrees of freedom) for the T ∈ R3
and 9 DOF for the linear affine matrix A ∈ R3×3 . Therefore, there are 12 parameters to
estimate. Every 3D point correspondence gives 3 sets of equations. Therefore, at least 4
point correspondences are required (4×3 = 12) for a unique solution over the 12 parameters.
(a) Using a projective camera model specialized for this particular scenario, write a gen-
eral formula that describes the relationship between world coordinates (x) (i.e., table
height), specifying the height of the table top, and image coordinates (u, v), specifying
the pixel coordinates where the point of light is detected. Give your answer using
homogeneous coordinates and a projection matrix containing variables. Note that this
is a 1D to 2D projective transformation.
The beam of light is a 1D line in the world, which projects to a line of points on the
table top, which in turn projects to a line in the images. Therefore this configuration
corresponds to a 1D to 2D projective transformation of the form
su p11 p12 " #
x
sv = p21 p22
1
s p31 p32
(b) For the first table top position given above and using your answer in (a), write out the
explicit equations that are generated by this one observation.
50p11 +p12
100s = 50p11 + p12 or 100 = 50p31 +p32
50p11 +p12
250s = 50p11 + p12 or 250 = 50p31 +p32
s = 50p31 + p32
(d) How many table top positions and associated images are required to solve for all of the
unknown parameters in the projective camera model?
Each table height yields two equations, so three positions are sufficient.
7
(e) Once the camera is calibrated, given a new unknown height of the table and an as-
sociated image, can the height of the table be uniquely solved for? If so, give the
equation(s) that is/are used. If not, describe briefly why not.
" # " #" #
su p11 p12 x −u
Given u, x can be uniquely determined by = so x = p31p12u−p .
s p31 1 1 11
Note that we could also have used just the v coordinate, though in practice it is a good
idea to use both coordinates together and solve for a least squares solution so as to
minimize measurement errors.
(f) If in each image we only measured the u pixel coordinate of the point of light, could
the camera still be calibrated? If so, how many table top positions are required? If
not, describe briefly why not.
This problem involves a line-to-line transformation, where points on the line of light
project to a line of image points where the beam intersects the table top. Hence if we
know the distance d measured along the image line from an origin, then this can be
represented using a single image parameter, d, as follows:
" # " #" #
sd p q x
=
s r 1 1
But given only the image u coordinate, we cannot compute d. So, unless the line in
the image is parallel to the u axis, the camera cannot be calibrated given only a set of
(u, x) pairs.
(g) If instead of assuming a projective camera for (a), we instead assume an affine camera
model for this problem, write a general formula that describes the relationship between
(x) and (u, v).
" # " #" #
u p11 p12 x
=
v p21 p22 1
(h) When would an affine camera model be appropriate instead of using a projective camera
model? Give one advantage and one disadvantage of using an affine camera instead of
a projective camera for this problem.
An affine camera model is appropriate when the scene has little depth variation relative
to the viewing distance. A major advantage is computational because an affine camera
is linear. A major disadvantage is that it is a weaker model of a true camera and can
only be used in the situations described above where, for example, parallel lines do not
converge in the image.
8
The left part of the figure represents the computation graph of forward propagation. The
computation it represents is:
h = ReLU(Wx + b1 )
ŷ = softmax (Uh + b2 )
X
J = CE(y, ŷ) = − yi log ŷi
i
Here ReLU (rectified linear unit) performs element-wise rectified linear function:
ReLU(z) = max(z, 0)
W ∈ Rm×n x ∈ Rn b1 ∈ Rm U ∈ Rk×m b2 ∈ Rk
9
z2 = Uh + b2
∂J
δ1 = = ŷ − y
∂z2
∂J
= δ1
∂b2
∂J
= δ1 hT
∂U
δ2 = UT δ1 ◦ 1{h > 0}
∂J
= δ2
∂b1
∂J
= δ2 xT
∂W
(b) The right part of the figure represents the computation graph of backpropagation. Each
gradient operation node in this part (e.g. dMatMul) takes as input not only the partial
gradients computed already along the backpropagation path, but also, optionally, the
inputs and outputs of the corresponding operation node in the forward propagation to
calculate gradients. Here, the MatMul node in the bottom left corresponds to dMatMul
node in the bottom right, and the ReLU node corresponds to the dReLU node. Briefly
explain why the corresponding inputs and/or outputs of the forward operation node
are sometimes needed when backpropagating to calculate gradient. Give an example
for each case.
i. Explanation
Because they may be used in the expression for gradient.
ii. Case 1: The input of the corresponding operation node is needed.
If h0 = Wx, and we already know δ = ∂h ∂J ∂J 0 T
0 . Then ∂W0 = δ x . So to calculate
∂J
∂W0
we need the value of x, which is input to the MatMul node.
These answers are also correct: ReLU, CE.
iii. Case 2: The output of the corresponding operation node.
Because σ 0 (x) = σ(x)(1 − σ(x)), calculating backpropagation of sigmoid function
would require the output of sigmoid node.
These answers are also correct: ReLU, tanh
iv. The CE node (softmax cross-entropy) in the graph expects unscaled inputs (they
are not yet exponentiated), since it performs a softmax internally for efficiency.
Give a reason why this is more efficient than a separate softmax and cross-entropy
layer.
Recall that the gradient of J with respect to the unexponentiated inputs to CE
node is ŷ − y.
This is very easy to compute.
We also give full credits to these reasons:
1. Don’t have to first take exp and then take log. Improve numerical stability.
2. y is one hot so we only need to compute ŷk where k is the correct label.
10
– End of Exam –
11