VSLAM Tutorial CVPR14 A13 BundleAdjustment Handout
VSLAM Tutorial CVPR14 A13 BundleAdjustment Handout
VSLAM Tutorial CVPR14 A13 BundleAdjustment Handout
Frank Dellaert
June 27, 2014
2 In a Linear World...
In a linear world, the measurement function h would be a linear function of the unknowns. For
reasons that will become clear, let us introduce a 6DOF unknown ξi associated with camera i, and
a 3DOF unknwon δ j associated with point j. Then linearity implies
∆
p̂i j = hi j ξi , δ j = Fi j ξi + Gi j δ j
where we also introduced bi j , the “linear” measurement. Then we can easily re-write the objective
function (2) as
∆
E ξ1 , ξ2 , δ j = kAx − bk2Σ0 (3)
with x and b vectorized forms of our unknowns and measurements, respectively, and Σ0 a suitable
defined (block-diagonal) covariance matrix.
1
3 Sparse Matters
The matrix A will be a block-sparse matrix [Hartley and Zisserman, 2004]. For example, with 3
points, we have
F11 G11 b11
F12 ξ1
G12
ξ2
b12
F13 G13 b13
A=
, x = δ1 , b = b14
F21 G 21 δ2
F22 G22 b15
δ3
F23 G23 b16
A linear least-squares criterion of the form (3) is easy to solve: we just take the derivative wrt x
and set to zero, obtaining the condition
A0 A x = A0 b,
(4)
also known as the normal equations. In the above, the symmetric matrix A0 A is the known as the
Hessian of the system. Because A is sparse, it is imperative that, for the sake of computational
efficiency, a sparse solver is used to solve for x, e.g., using GTSAM or cholmod. The latter is built
into MATLAB and is automatically invoked simply by typing x = A\b.
4 Back to Reality
In reality, the measurement function h(Tcw , Pw ) is quite non-linear, and generates the predicted
measurement p̂ by first transforming the point Pw into camera coordinates Pc , as specified by the
camera Tcw , then projecting the point so obtained into the image:
∆
p̂ = h (Tcw , Pw ) = π (g (Tcw , Pw ))
Then the predicted image coordinates p̂ are obtained by the camera projection function π
∆ Xc Y c
p̂ = (x̂, ŷ) = π (X c ,Y c , Z c ) = ( , ) (6)
Zc Zc
Above we assumed that the measurements p are pre-calibrated to normalized image coordinates
p = (x, y), as we are interested in the setting where the calibration matrix K and radial distortion
function f (r) are known.
So how can we get back to the easy, linear setting? Maybe we can do a Taylor expansion, as in
h(Tcw + ξ , Pw + δ ) ≈ h(Tcw , Pw ) + Fξ + Gδ .
However, here we run into trouble, as “+” is not defined for elements of SE(3). The next section
introduces a more general view of Taylor expansion that will work, however.
2
5 A Twist of Li(m)e
The notion of an incremental update to an element on a manifold a ∈ M is provided by the notion
of a tangent space at a a, denoted Ta M , which is a vector space Rn with n equal to the dimen-
sionality of the manifold. Any vector x ∈ Ta M in the tangent space can be mapped back to the
manifold, with x = 0 mapping to a. Hence, vectors x correspond to updates around a, and this pro-
vides a way for our linearized problem to update the solution on the manifold [Absil et al., 2007].
In the special case that the manifold is also a group, i.e., a Lie group, the mapping is provided
by the exponential map. In what follows, I use the notation from [Murray et al., 1994], directly
specialized to SE(3): a vector ξ ∈ R6 is mapped to an element of the Lie Algebra se(3), in this
case a 4 × 4 matrix, which is then mapped to a Lie group element by matrix exponentiation:
For SE(3) the 6-dimensional quantity (ω, v) is also called a twist, whereas θ ∈ R+ is the amount
of twist applied, resulting in a finite screw motion. In general, the elements of the vector space
isomorphic to the Lie algebra are called canonical coordinates. More details about the Lie algebra
se(3) and its corresponding exponential map can be found in Appendix A.
where ξˆ ∈ g, the Lie algebra of G, and f 0 (a) is an m × n Jacobian matrix, linearly mapping the
n-dimensional exponential coordinates ξ to m-dimensional corrections on f (a). Note that f 0 (a) is
itself a function of a, i.e., the linear mapping is in general different for different values of a.
In the image projection case, we need to similarly generalize the notion of partial derivatives,
to approximate the two-argument measurement function h:
ˆ
h(Tcw eξ , Pw + δ ) ≈ h(Tcw , Pw ) + h[1] (Tcw , Pw )ξ + h[2] (Tcw , Pw )δ
Above h[1] (.) and h[2] (.) are the 2 × 6 and 2 × 3 Jacobian matrices corresponding to the partial
derivatives of h in its first and second arguments, respectively. Note that the incremental pose
exp(ξˆ ) is applied on the camera side, and hence the exponential coordinates ξ = (ω, v) are ex-
pressed in the camera frame.
3
7 Nonlinear Optimization
∆ ∆
By defining Fi j = h[1] (Tiw , Pjw ), Gi j = h[2] (Tiw , Pjw ), and the linearization prediction error bi j =
h(Tiw , Pjw ) − pi j , we recover exactly the linear objective function in (2), which we can minimize
for the optimal exponential coordinates ξi and δ j .
Gauss-Newton nonlinear optimization simply iterates between linearization, solve, and up-
date, until a predefined convergence criterion is stasfied:
1. Start with a good initial estimate θ 0 = T1w , T2w , Pj
where A0 A the Gauss-Newton approximation to the Hessian of the actual non-linear system.
4. Update the nonlinear estimate θ it+1 using the exponential map
5. If not converged, go to 2.
4
9 A Matter of Trust
When far from the solution, the Hessian approximation A0 A might be valid for a very small region
only, and taking “quadratic” steps as given by the normal equations will lead to divergence. A well
known way to deal with this issue is to limit the steps based on the amount of “trust” we have in
the current linearization, switching to a more cautious gradient descent approach when far from
the local minimum.
The most well-known of these so-called “region trust” methods is Levenberg-Marquardt
(LM), which solves for
A0 A + λ D x = A0 b,
where D is either the diagonal of A or the identity matrix. For the latter case, with λ high, the
solution is given by
x = λ −1 A0 b,
which is gradient descent with step size λ −1 . When λ is zero or small, LM reverts to quadratic
Gauss-Newton steps. Simple implementations of LM start with high values of λ and gradually
decrease it, backing off if this leads to steps with non-decreasing error. More sophisticated variants
put a lot more work into finding a good λ at every step.
What LM is not designed for is masking out gauge freedoms, although in many implementa-
tions it is used that way: taking smaller steps when not needed will lead to slower convergence.
1 1 0 −X c /Z c
0 1 1 0 −x̂
π (pc ) = c = c (7)
Z 0 1 −Y c /Z c z 0 1 −ŷ
2. The 3 × 3 Jacobian matrix g[2] (Tcw , Pw ) of the coordinate transformation g with respect to a
change δ in the point Pw is also easy to compute, as
T w T
g(Tcw , Pw + δ ) = (Rw w w
c ) (P + tc ) + (Rc ) δ
5
3. The change in pose is a bit more challenging. It can be shown, however, that
w ξˆ w
g Tc e , P = Pc + Pc × ω − v
This is linear in both ω and v, and hence the 3 × 6 Jacobian matrix g[1] (Tcw , Pw ) of g with
respect to a change in pose easily follows as
In other words, rotating the camera frame is magnified by the “arm” of the point in cam-
era coordinates, whereas moving the camera simply moves the point’s coordinates in the
opposite direction.
where ik and jk encode the data association, assumed known here. Techniques such as Levenberg-
Marquardt and obtaining a good initial estimate will be of greater importance for the many-cameras
case, as the objective function above is very non-linear, especially when points move closer to
image planes and “coross-over” to the wrong side, in which case local minimizers such as Gauss
Newton and LM will almost always converge to a local minimum.
Another often used technique in batch SfM is the “Schur complement” trick, where in the
linear step one first solves for the point updates δ j as a function of pose updates ξ j . However,
the utility of that in an online, visual SLAM scenario is debatable - a discussion for a different part
of the tutorial.
6
A 3D Rigid Transformations
The Lie group SE(3) is a subgroup of the general linear group GL(4) of 4 × 4 invertible matrices
of the form
∆ R t
T=
0 1
where R ∈ SO(3) is a rotation matrix and t ∈ R3 is a translation vector. An alternative way of
writing down elements of SE(3) is as the ordered pair (R, t), with composition defined as
(R1 , t1 )(R2 , t2 ) = (R1 R2 , R1t2 + t1 )
Its Lie algebra se(3) is the vector space of 4 × 4 twists ξˆ parameterized by the twist coordinates
ξ ∈ R6 , with the mapping [Murray et al., 1994]
6
∆ ω ˆ ∆ [ω]× v
ξ= θ →ξ = θ = ∑ ξi Gi
v 0 0 i=1
where the 4 × 4 matrices G are called the generators for se(3). Note I follow a different convention
from Murray and reserve the first three components for rotation, and the last three for translation.
Hence, with this parameterization, the generators for SE(3) are
0 0 0 0 0 0 1 0 0 −1 0 0
0 0 −1 0 2 0 0 0 0 3 1 0 0 0
G1 = 0 1 0 0 G = −1 0 0 0 G = 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 5 0 0 0 1 6 0 0 0 0
G4 = 0 0 0 0 G = 0 0 0 0 G = 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0
Applying the exponential map to a twist ξ , scaled with θ ∈ R+ , yields a screw motion in SE(3):
T (t) = exp ξˆ θ
A closed form solution for the exponential map is given in [Murray et al., 1994, page 42].
\ ! [ω] θ
e × (I − e[ω]× θ ) (ω × v) + ωω T vθ
ω
exp t =
v 0 1
References
[Absil et al., 2007] Absil, P.-A., Mahony, R., and Sepulchre, R. (2007). Optimization Algorithms
on Matrix Manifolds. Princeton University Press, Princeton, NJ, USA.
[Hartley and Zisserman, 2004] Hartley, R. I. and Zisserman, A. (2004). Multiple View Geometry
in Computer Vision. Cambridge University Press, second edition.
[Murray et al., 1994] Murray, R., Li, Z., and Sastry, S. (1994). A Mathematical Introduction to
Robotic Manipulation. CRC Press.