Hadsell Chopra Lecun 06 PDF
Hadsell Chopra Lecun 06 PDF
Hadsell Chopra Lecun 06 PDF
1
• The mapping generated by the function is in some a high dimensional feature space and then extract the prin-
sense “smooth” and coherent in the output space. cipal components. The algorithm first expresses the PCA
computation solely in terms of dot products and then ex-
A contrastive loss function is employed to learn the param- ploits the kernel trick to implicitly compute the high dimen-
eters W of a parameterized function GW , in such a way that sional mapping. The choice of kernels is crucial: differ-
neighbors are pulled together and non-neighbors are pushed ent kernels yield dramatically different embeddings. In re-
apart. Prior knowledge can be used to identify the neighbors cent work, Weinberger et al. in [11, 12] attempt to learn
for each training data point. the kernel matrix when the high dimensional input lies on a
The method uses an energy based model that uses the low dimensional manifold by formulating the problem as a
given neighborhood relationships to learn the mapping semidefinite program. There are also related algorithms for
function. For a family of functions G, parameterized by W , clustering due to Shi and Malik [14] and Ng et al. [17].
the objective is to find a value of W that maps a set of high The proposed approach is different from these methods;
dimensional inputs to the manifold such that the euclidean it learns a function that is capable of consistently mapping
distance between points on the manifold, DW (X ~1 , X
~2 ) =
new points unseen during training. In addition, this function
~ ~
||GW (X1 ) − GW (X2 )||2 approximates the “semantic sim- is not constrained by simple distance measures in the input
ilarity”of the inputs in input space, as provided by a set of space. The learning architecture is somewhat similar to the
neighborhood relationships. No assumption is made about one discussed in [4, 5].
the function GW except that it is differentiable with respect Section 2 describes the general framework, the loss func-
to W . tion, and draws an analogy with a mechanical spring sys-
tem. The ideas in this section are made concrete in sec-
1.1. Previous Work tion 3. Here various experimental results are given and com-
The problem of mapping a set of high dimensional points parisons to LLE are made.
onto a low dimensional manifold has a long history. The
two classical methods for the problem are Principal Com- 2. Learning the Low Dimensional Mapping
ponent Analysis (PCA) [7] and Multi-Dimensional Scal-
ing (MDS) [6]. PCA involves the projection of inputs to a The problem is to find a function that maps high dimen-
low dimensional subspace that maximizes the variance. In sional input patterns to lower dimensional outputs, given
MDS, one computes the projection that best preserves the neighborhood relationships between samples in input space.
pairwise distances between input points. However both the The graph of neighborhood relationships may come from
methods - PCA in general and MDS in the classical scaling information source that may not be available for test points,
case (when the distances are euclidean distances) - generate such as prior knowledge, manual labeling, etc. More pre-
cisely, given a set of input vectors I = {X ~1 , . . . , X~P },
a linear embedding.
In recent years there has been a lot of activity in design- ~ D
where Xi ∈ ℜ , ∀i = 1, . . . , n, find a parametric func-
ing non-linear spectral methods for the problem. These tion GW : ℜD −→ ℜd with d ≪ D, such that it has the
methods involve solving the eigenvalue problem for a following properties:
particular matrix. Recently proposed algorithms include
ISOMAP (2000) by Tenenbaum et al. [1], Local Linear Em- 1. Simple distance measures in the output space (such as
bedding - LLE (2000) by Roweis and Saul [15], Laplacian euclidean distance) should approximate the neighbor-
Eigenmaps (2003) due to Belkin and Niyogi [2] and Hes- hood relationships in the input space.
sian LLE (2003) by Donoho and Grimes [8]. All the above 2. The mapping should not be constrained to implement-
methods have three main steps. The first is to identify a list ing simple distance measures in the input space and
of neighbors of each point. Second, a gram matrix is com-
should be able to learn invariances to complex trans-
puted using this information. Third, the eigenvalue prob- formations.
lem is solved for this matrix. The methods differ in how the
gram matrix is computed. None of these methods attempt 3. It should be faithful even for samples whose neighbor-
to compute a function that could map a new, unknown data hood relationships are unknown.
point without recomputing the entire embedding and with-
out knowing its relationships to the training points. Out- 2.1. The Contrastive Loss Function
of-sample extensions to the above methods have been pro-
posed by Bengio et al. in [3], but they too rely on a prede- Consider the set I of high dimensional training vectors
X~ i . Assume that for each X
~ i ∈ I there is a set S ~ of train-
termined computable distance metric. Xi
Along a somewhat different line Schöelkopf et al. in ing vectors that are deemed similar to X ~ i . This set can be
1998 [13] proposed a non-linear extension of PCA, called computed by some prior knowledge - invariance to distor-
Kernel PCA. The idea is to non-linearly map the inputs to tions or temporal proximity, for instance - which does not
depend on a simple distance. A meaningful mapping from where m > 0 is a margin. The margin defines a radius
high to low dimensional space maps similar input vectors to around GW (X). ~ Dissimilar pairs contribute to the loss
nearby points on the output manifold and dissimilar vectors function only if their distance is within this radius (See fig-
to distant points. A new loss function whose minimization ure 1). The contrastive term involving dissimilar pairs, LD ,
can produce such a function is now introduced. Unlike con- is crucial. Simply minimizing DW (X ~1 , X
~2 ) over the set of
ventional learning systems where the loss function is a sum all similar pairs will usually lead to a collapsed solution,
over samples, the loss function here runs over pairs of sam- since DW and the loss L could then be made zero by set-
ples. Let X~1 , X
~2 ∈ I be a pair of input vectors shown to the ting GW to a constant. Most energy-based models require
system. Let Y be a binary label assigned to this pair. Y = 0 the use of an explicit contrastive term in the loss function.
~1 and X
if X ~2 are deemd similar, and Y = 1 if they are
deemed dissimilar. Define the parameterized distance func- 2.2. Spring Model Analogy
tion to be learned DW between X ~1 , X
~2 as the euclidean
An analogy to a particular mechanical spring system is
distance between the outputs of GW . That is, given to provide an intuition of what is happening when
~1 , X
~2 ) = kGW (X
~1 ) − GW (X
~2 )k2 the loss function is minimized. The outputs of GW can be
DW (X (1)
thought of as masses attracting and repelling each other with
To shorten notation, DW (X ~1 , X
~2 ) is written DW . Then the springs. Consider the equation of a spring
loss function in its most general form is F = −KX (5)
P
X
~1, X
~2 )i ) where F is the force, K is the spring constant and X is the
L(W ) = L(W, (Y, X (2) displacement of the spring from its rest length. A spring
i=1
is attract-only if its rest length is equal to zero. Thus any
~1, X
~2 )i ) = (1 − Y )LS DW
i i positive displacement X will result in an attractive force
L(W, (Y, X + Y L D DW
(3) between its ends. A spring is said to be m-repulse-only if its
rest length is equal to m. Thus two points that are connected
where (Y, X ~1 , X
~2 )i is the i-th labeled sample pair, LS is with a m-repulse-only spring will be pushed apart if X is
the partial loss function for a pair of similar points, LD the less than m. However this spring has a special property
partial loss function for a pair of dissimilar points, and P that if the spring is stretched by a length X > m, then no
the number of training pairs (which may be as large as the attractive force brings it back to rest length. Each point is
square of the number of samples). connected to other points using these two kinds of springs.
LS and LD must be designed such that minimizing L Seen in the light of the loss function, each point is connected
with respect to W would result in low values of DW for by attract-only springs to similar points, and is connected
similar pairs and high values of DW for dissimilar pairs. by m-repulse-only spring to dissimilar points. See figure 2.
Consider the loss function LS (W, X ~1 , X
~2 ) associated
3.5
with similar pairs.
~2 ) = 1 (DW )2
3
~1 , X
LS (W, X (6)
2.5 2
2
The loss function L is minimized using the stochastic gra-
Loss (L)
∂LS ∂DW
1 = DW (7)
∂W ∂W
0.5
(force) on the two points that are dissimilar and are at a The learning architecture is similar to the one used in [4]
distance DW > m. If DW < m then and [5]. Called a siamese architecture, it consists of two
copies of the function GW which share the same set of pa-
∂LD ∂DW rameters W , and a cost module. A loss module whose input
= −(m − DW ) (9)
∂W ∂W is the output of this architecture is placed on top of it. The
Again, comparing equations 5 and 9 it is clear that the dis- input to the entire system is a pair of images (X ~1 , X
~2 ) and
similar loss function LD corresponds to the m-repulse-only a label Y . The images are passed through the functions,
spring; its gradient gives the force of the spring, ∂D yielding two outputs G(X ~1 ) and G(X ~2 ). The cost module
∂W gives
W
the spring constant K and (m− DW ) gives the perturbation then generates the distance DW (GW (X ~1 ), GW (X~2 )). The
X. The negative sign denotes the fact that the force is re- loss function combines DW with label Y to produce the
pulsive only. Clearly the force is maximum when DW = 0 scalar loss LS or LD , depending on the label Y . The pa-
and absent when DW = m. See figure 2. rameter W is updated using stochastic gradient. The gradi-
Here, especially in the case of LS , one might think that ents can be computed by back-propagation through the loss,
simply making DW = 0 for all attract-only springs would the cost, and the two instances of GW . The total gradient is
put the system in equilibrium. Consider, however, figure 2e. the sum of the contributions from the two instances.
Suppose b1 is connected to b2 and b3 with attract-only The experiments involving airplane images from the
springs. Then decreasing DW between b1 and b2 will in- NORB dataset [10] use a 2-layer fully connected neural
crease DW between b1 and b3 . Thus by minimizing the network as GW . The number of hidden and output units
used was 20 and 3 respectively. Experiments on the MNIST
dataset [9] used a convolutional network as GW (figure 3).
Convolutional networks are trainable, non-linear learning
machines that operate at pixel level and learn low-level fea-
tures and high-level representations in an integrated manner.
They are trained end-to-end to map images to outputs. Be-
cause of a structure of shared weights and multiple layers,
they can learn optimal shift-invariant local feature detectors
while maintaining invariance to geometric distortions of the
input image.
Figure 3. Architecture of the function GW (a convolutional net- Figure 4. Experiment demonstrating the effectiveness of the Dr-
work) which was learned to map the MNIST data to a low dimen- LIM in a trivial situation with MNIST digits. A Euclidean near-
sional manifold with invariance to shifts. est neighbor metric is used to create the local neighborhood rela-
tionships among the training samples, and a mapping function is
The layers of the convolutional network comprise a con- learned with a convolutional network. Figure shows the placement
volutional layer C1 with 15 feature maps, a subsampling of the test samples in output space. Even though the neighborhood
layer S2 , a second convolutional layer C3 with 30 feature relationships among these samples are unknown, they are well or-
ganized and evenly distributed on the 2D manifold.
maps, and fully connected layer F3 with 2 units. The sizes
of the kernels for the C1 and C3 were 6x6 and 9x9 respec-
tively.
3.3. Learning a Shift-Invariant Mapping of MNIST
3.2. Learned Mapping of MNIST samples samples
The first experiment is designed to establish the basic In this experiment, the DrLIM approach is evaluated us-
functionality of the DrLIM approach. The neighborhood ing 2 categories of MNIST, distorted by adding samples that
graph is generated with euclidean distances and no prior have been horizontally translated. The objective is to learn
knowledge. a 2D mapping that is invariant to horizontal translations.
The training set is built from 3000 images of the hand- In the distorted set, 3000 images of 4’s and 3000 im-
written digit 4 and 3000 images of the handwritten digit 9 ages of 9’s are horizontally translated by -6, -3, 3, and 6
chosen randomly from the MNIST dataset [9]. Approxi- pixels and combined with the originals, producing a total
mately 1000 images of each digit comprised the test set. of 30,000 samples. The 2000 samples in the test set were
These images were shuffled, paired, and labeled according distorted in the same way.
to a simple euclidean distance measure: each sample X ~i First the system was trained using pairs from a euclidean
was paired with its 5 nearest neighbors, producing the set distance neighborhood graph (5 nearest neighbors per sam-
SXi . All other possible pairs were labeled dissimilar, pro- ple), as in experiment 1. The large distances between trans-
ducing 30,000 similar pairs and on the order of 18 million lated samples creates a disjoint neighborhood relationship
dissimilar pairs. graph and the resulting mapping is disjoint as well. The out-
The mapping of the test set to a 2D manifold is shown put points are clustered according to the translated position
in figure 4. The lighter-colored blue dots are 9’s and the of the input sample (figure 5). Within each cluster, however,
darker-colored red dots are 4’s. Several input test samples the samples are well organized and evenly distributed.
are shown next to their manifold positions. The 4’s and 9’s For comparison, the LLE algorithm was used to map the
are in two somewhat overlapping regions, with an overall distorted MNIST using the same euclidean distance neigh-
organization that is primarily determined by the slant angle borhood graph. The result was a degenerate embedding in
of the samples. The samples are spread rather uniformly in which differently registered samples were completely sepa-
the populated region. rated (figure 6). Although there is sporadic local organiza-
paired with (a) its 5 nearest neighbors, (b) its 4 translations,
and (c) the 4 translations of each of its 5 nearest neighbors.
Additionally, each of the sample’s 4 translations was paired
with (d) all the above nearest neighbors and translated sam-
ples. All other possible pairs are labeled as dissimilar.
The mapping of the test set samples is shown in figure 7.
The lighter-colored blue dots are 4’s and the darker-colored
red dots are 9’s. As desired, there is no organization on the
basis of translation; in fact, translated versions of a given
character are all tighlty packed in small regions on the man-
ifold.
similar pairs, the system avoids collapse to a constant func- In Proceedings of the IEEE Conference on Computer Vision
tion and maintains an equilibrium in output space, much as and Pattern Recognition (CVPR-05), 1:539–546, 2005. 2, 4
a mechanical system of interconnected springs does. [6] T. Cox and M. Cox. Multidimensional scaling. London:
The experiments with LLE show that LLE is most useful Chapman and Hill, 1994. 2
where the input samples are locally very similar and well- [7] T. I. Jolliffe. Principal component analysis. New York:
Springer-Verlag, 1986. 2
registered. If this is not the case, then LLE may give degen-
[8] D. L. Donoho and C. E. Grimes. Hessian eigenmap: Lo-
erate results. Although it is possible to run LLE with arbi-
cally linear embedding techniques for high dimensional data.
trary neighborhood relationships, the linear reconstruction Proceedings of the National Academy of Arts and Sciences,
of the samples negates the effect of very distant neighbors. 100:5591–5596, 2003. 1, 2
Other dimensionality reduction methods have avoided this [9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
limitation, but none produces a function that can accept new based learning applied to document recognition. Proceed-
samples without recomputation or prior knowledge. ings of the IEEE, 86(11):2278–2324, 1998. 5
Creating a dimensionality reduction mapping using prior [10] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for
knowledge has other uses. Given the success of the NORB generic object recognition with invariance to pose and light-
experiment, in which the positions of the camera were ing. In Proceedings of the IEEE Conference on Computer
learned from prior knowledge of the temporal connections Vision and Pattern Recognition (CVPR-04), 2:97–104, 2004.
between images, it may be feasible to learn a robot’s posi- 4, 6
[11] K. Q. Weinberger and L. K. Saul. Unsupervised learning of
tion and heading from image sequences.
image manifolds by semidefinite programming. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
References Recognition (CVPR-04), 2:988–995, 2004. 2
[1] J. B. Tenenbaum, V. DeSilva, and J. C. Langford. A global [12] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel
geometric framework for non linear dimensionality reduc- matrix for nonlinear dimensionality reduction. In Proceed-
tion. Science, 290:2319–2323, 2000. 2 ings of the Twenty First International Conference on Ma-
chine Learning (ICML-04), pages 839–846, 2004. 2
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spec-
tral techniques for embedding and clustering. Advances in [13] B. Schőelkopf, A. J. Smola, and K. R. Muller. Nonlinear
Neural Information Processing Systems, 15(6):1373–1396, component analysis as a kernel eigen-value problem. Neural
2003. 1, 2 Computation, 10:1299–1219, 1998. 2
[14] J. Shi and J. Malik. Normalized cuts and image segmenta-
[3] Y. Bengio, J. F. Paiement, and P. Vincent. Out-of-sample ex-
tion. IEEE Transactions on Pattern Analysis and Machine
tensions for lle, isomap, mds, eigenmaps, and spectral clus-
Intelligence (PAMI), pages 888–905, 2000. 2
tering. Advances in Neural Information Processing Systems,
[15] S. T. Roweis and L. K. Saul. Nonlinear dimensionality re-
16, 2004. In S. Thrun, L.K. Saul and B. Scholkopf, editors.
duction by locally linear embedding. Science, 290. 1, 2
Cambrige, MA. MIT Press. 1, 2
[16] P. Vincent and Y. Bengio. A neural support vector network
[4] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah.
architecture with adaptive kernels. In Proc. of the Interna-
Signature verification using a siamese time delay neural net-
tional Joint Conference on Neural Networks, 5, July 2000.
work. J. Cowan and G. Tesauro (eds) Advances in Neural
[17] A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering:
Information Processing Systems, 1993. 2, 4
Analysis and an algorithm. Advances in Neural Information
[5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity
Processing Systems, 14:849–856, 2002. 2
metric discriminatively, with applications to face verificaton.