Visualizing Data Using t-SNE: Laurens Van Der Maaten
Visualizing Data Using t-SNE: Laurens Van Der Maaten
Visualizing Data Using t-SNE: Laurens Van Der Maaten
Abstract
We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each
datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic
Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces
significantly better visualizations by reducing the tendency to crowd points together in the center
of the map. t-SNE is better than existing techniques at creating a single map that reveals structure
at many different scales. This is particularly important for high-dimensional data that lie on several
different, but related, low-dimensional manifolds, such as images of objects from multiple classes
seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how
t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the
data to influence the way in which a subset of the data is displayed. We illustrate the performance of
t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization
techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualiza-
tions produced by t-SNE are significantly better than those produced by the other techniques on
almost all of the data sets.
Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms,
multidimensional scaling
1. Introduction
2008
c Laurens van der Maaten and Geoffrey Hinton.
VAN DER M AATEN AND H INTON
data to the human observer. This severely limits the applicability of these techniques to real-world
data sets that contain thousands of high-dimensional datapoints.
A large number of nonlinear dimensionality reduction techniques that aim to preserve the local
structure of data have been proposed, many of which are reviewed by Lee and Verleysen (2007).
In particular, we mention the following seven techniques: (1) Sammon mapping (Sammon, 1969),
(2) curvilinear components analysis (CCA; Demartines and Hérault, 1997), (3) Stochastic Neighbor
Embedding (SNE; Hinton and Roweis, 2002), (4) Isomap (Tenenbaum et al., 2000), (5) Maximum
Variance Unfolding (MVU; Weinberger et al., 2004), (6) Locally Linear Embedding (LLE; Roweis
and Saul, 2000), and (7) Laplacian Eigenmaps (Belkin and Niyogi, 2002). Despite the strong per-
formance of these techniques on artificial data sets, they are often not very successful at visualizing
real, high-dimensional data. In particular, most of the techniques are not capable of retaining both
the local and the global structure of the data in a single map. For instance, a recent study reveals
that even a semi-supervised variant of MVU is not capable of separating handwritten digits into
their natural clusters (Song et al., 2007).
In this paper, we describe a way of converting a high-dimensional data set into a matrix of pair-
wise similarities and we introduce a new technique, called “t-SNE”, for visualizing the resulting
similarity data. t-SNE is capable of capturing much of the local structure of the high-dimensional
data very well, while also revealing global structure such as the presence of clusters at several scales.
We illustrate the performance of t-SNE by comparing it to the seven dimensionality reduction tech-
niques mentioned above on five data sets from a variety of domains. Because of space limitations,
most of the (7 + 1) × 5 = 40 maps are presented in the supplemental material, but the maps that we
present in the paper are sufficient to demonstrate the superiority of t-SNE.
The outline of the paper is as follows. In Section 2, we outline SNE as presented by Hinton and
Roweis (2002), which forms the basis for t-SNE. In Section 3, we present t-SNE, which has two
important differences from SNE. In Section 4, we describe the experimental setup and the results
of our experiments. Subsequently, Section 5 shows how t-SNE can be modified to visualize real-
world data sets that contain many more than 10, 000 datapoints. The results of our experiments are
discussed in more detail in Section 6. Our conclusions and suggestions for future work are presented
in Section 7.
2580
V ISUALIZING DATA USING T-SNE
where σi is the variance of the Gaussian that is centered on datapoint x i . The method for determining
the value of σi is presented later in this section. Because we are only interested in modeling pairwise
similarities, we set the value of pi|i to zero. For the low-dimensional counterparts yi and y j of the
high-dimensional datapoints xi and x j , it is possible to compute a similar conditional probability,
which we denote by q j|i . We set2 the variance of the Gaussian that is employed in the computation
of the conditional probabilities q j|i to √12 . Hence, we model the similarity of map point y j to map
point yi by
exp −kyi − y j k2
q j|i = .
∑k6=i exp (−kyi − yk k2 )
Again, since we are only interested in modeling pairwise similarities, we set q i|i = 0.
If the map points yi and y j correctly model the similarity between the high-dimensional data-
points xi and x j , the conditional probabilities p j|i and q j|i will be equal. Motivated by this observa-
tion, SNE aims to find a low-dimensional data representation that minimizes the mismatch between
p j|i and q j|i . A natural measure of the faithfulness with which q j|i models p j|i is the Kullback-
Leibler divergence (which is in this case equal to the cross-entropy up to an additive constant). SNE
minimizes the sum of Kullback-Leibler divergences over all datapoints using a gradient descent
method. The cost function C is given by
p j|i
C = ∑ KL(Pi ||Qi ) = ∑ ∑ p j|i log , (2)
i i j q j|i
in which Pi represents the conditional probability distribution over all other datapoints given data-
point xi , and Qi represents the conditional probability distribution over all other map points given
map point yi . Because the Kullback-Leibler divergence is not symmetric, different types of error
in the pairwise distances in the low-dimensional map are not weighted equally. In particular, there
is a large cost for using widely separated map points to represent nearby datapoints (i.e., for using
1. SNE can also be applied to data sets that consist of pairwise similarities between objects rather than high-dimensional
vector representations of each object, provided these simiarities can be interpreted as conditional probabilities. For
example, human word association data consists of the probability of producing each possible word in response to a
given word, as a result of which it is already in the form required by SNE.
2. Setting the variance in the low-dimensional Gaussians to another value only results in a rescaled version of the final
map. Note that by using the same variance for every datapoint in the low-dimensional map, we lose the property
that the data is a perfect model of itself if we embed it in a space of the same dimensionality, because in the high-
dimensional space, we used a different variance σi in each Gaussian.
2581
VAN DER M AATEN AND H INTON
a small q j|i to model a large p j|i ), but there is only a small cost for using nearby map points to
represent widely separated datapoints. This small cost comes from wasting some of the probability
mass in the relevant Q distributions. In other words, the SNE cost function focuses on retaining the
local structure of the data in the map (for reasonable values of the variance of the Gaussian in the
high-dimensional space, σi ).
The remaining parameter to be selected is the variance σi of the Gaussian that is centered over
each high-dimensional datapoint, xi . It is not likely that there is a single value of σi that is optimal
for all datapoints in the data set because the density of the data is likely to vary. In dense regions,
a smaller value of σi is usually more appropriate than in sparser regions. Any particular value of
σi induces a probability distribution, Pi , over all of the other datapoints. This distribution has an
entropy which increases as σi increases. SNE performs a binary search for the value of σ i that
produces a Pi with a fixed perplexity that is specified by the user.3 The perplexity is defined as
Perp(Pi ) = 2H(Pi ) ,
The perplexity can be interpreted as a smooth measure of the effective number of neighbors. The
performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5
and 50.
The minimization of the cost function in Equation 2 is performed using a gradient descent
method. The gradient has a surprisingly simple form
δC
= 2 ∑(p j|i − q j|i + pi| j − qi| j )(yi − y j ).
δyi j
Physically, the gradient may be interpreted as the resultant force created by a set of springs between
the map point yi and all other map points y j . All springs exert a force along the direction (yi − y j ).
The spring between yi and y j repels or attracts the map points depending on whether the distance
between the two in the map is too small or too large to represent the similarities between the two
high-dimensional datapoints. The force exerted by the spring between y i and y j is proportional to its
length, and also proportional to its stiffness, which is the mismatch (p j|i − q j|i + pi| j − qi| j ) between
the pairwise similarities of the data points and the map points.
The gradient descent is initialized by sampling map points randomly from an isotropic Gaussian
with small variance that is centered around the origin. In order to speed up the optimization and to
avoid poor local minima, a relatively large momentum term is added to the gradient. In other words,
the current gradient is added to an exponentially decaying sum of previous gradients in order to
determine the changes in the coordinates of the map points at each iteration of the gradient search.
Mathematically, the gradient update with a momentum term is given by
δC
Y (t) = Y (t−1) + η + α(t) Y (t−1) − Y (t−2) ,
δY
3. Note that the perplexity increases monotonically with the variance σ i .
2582
V ISUALIZING DATA USING T-SNE
where Y (t) indicates the solution at iteration t, η indicates the learning rate, and α(t) represents the
momentum at iteration t.
In addition, in the early stages of the optimization, Gaussian noise is added to the map points
after each iteration. Gradually reducing the variance of this noise performs a type of simulated
annealing that helps the optimization to escape from poor local minima in the cost function. If the
variance of the noise changes very slowly at the critical point at which the global structure of the
map starts to form, SNE tends to find maps with a better global organization. Unfortunately, this
requires sensible choices of the initial amount of Gaussian noise and the rate at which it decays.
Moreover, these choices interact with the amount of momentum and the step size that are employed
in the gradient descent. It is therefore common to run the optimization several times on a data set
to find appropriate values for the parameters.4 In this respect, SNE is inferior to methods that allow
convex optimization and it would be useful to find an optimization method that gives good results
without requiring the extra computation time and parameter choices introduced by the simulated
annealing.
where again, we set pii and qii to zero. We refer to this type of SNE as symmetric SNE, because it
has the property that pi j = p ji and qi j = q ji for ∀i, j. In symmetric SNE, the pairwise similarities in
4. Picking the best map after several runs as a visualization of the data is not nearly as problematic as picking the model
that does best on a test set during supervised learning. In visualization, the aim is to see the structure in the training
data, not to generalize to held out test data.
2583
VAN DER M AATEN AND H INTON
exp −kyi − y j k2
qi j = , (3)
∑k6=l exp (−kyk − yl k2 )
The obvious way to define the pairwise similarities in the high-dimensional space p i j is
but this causes problems when a high-dimensional datapoint x i is an outlier (i.e., all pairwise dis-
tances kxi − x j k2 are large for xi ). For such an outlier, the values of pi j are extremely small for
all j, so the location of its low-dimensional map point yi has very little effect on the cost function.
As a result, the position of the map point is not well determined by the positions of the other map
points. We circumvent this problem by defining the joint probabilities p i j in the high-dimensional
p +p
space to be the symmetrized conditional probabilities, that is, we set p i j = j|i2n i| j . This ensures that
∑ j pi j > 2n
1
for all datapoints xi , as a result of which each datapoint xi makes a significant contri-
bution to the cost function. In the low-dimensional space, symmetric SNE simply uses Equation 3.
The main advantage of the symmetric version of SNE is the simpler form of its gradient, which is
faster to compute. The gradient of symmetric SNE is fairly similar to that of asymmetric SNE, and
is given by
δC
= 4 ∑(pi j − qi j )(yi − y j ).
δyi j
In preliminary experiments, we observed that symmetric SNE seems to produce maps that are just
as good as asymmetric SNE, and sometimes even a little better.
2584
V ISUALIZING DATA USING T-SNE
that are at a moderate distance from datapoint i will have to be placed much too far away in the
two-dimensional map. In SNE, the spring connecting datapoint i to each of these too-distant map
points will thus exert a very small attractive force. Although these attractive forces are very small,
the very large number of such forces crushes together the points in the center of the map, which
prevents gaps from forming between the natural clusters. Note that the crowding problem is not
specific to SNE, but that it also occurs in other local techniques for multidimensional scaling such
as Sammon mapping.
An attempt to address the crowding problem by adding a slight repulsion to all springs was pre-
sented by Cook et al. (2007). The slight repulsion is created by introducing a uniform background
model with a small mixing proportion, ρ. So however far apart two map points are, q i j can never fall
2ρ
below n(n−1) (because the uniform background distribution is over n(n − 1)/2 pairs). As a result, for
datapoints that are far apart in the high-dimensional space, q i j will always be larger than pi j , leading
to a slight repulsion. This technique is called UNI-SNE and although it usually outperforms stan-
dard SNE, the optimization of the UNI-SNE cost function is tedious. The best optimization method
known is to start by setting the background mixing proportion to zero (i.e., by performing standard
SNE). Once the SNE cost function has been optimized using simulated annealing, the background
mixing proportion can be increased to allow some gaps to form between natural clusters as shown
by Cook et al. (2007). Optimizing the UNI-SNE cost function directly does not work because two
map points that are far apart will get almost all of their qi j from the uniform background. So even
if their pi j is large, there will be no attractive force between them, because a small change in their
separation will have a vanishingly small proportional effect on q i j . This means that if two parts of
a cluster get separated early on in the optimization, there is no force to pull them back together.
2585
VAN DER M AATEN AND H INTON
Figure 1: Gradients of three types of SNE as a function of the pairwise Euclidean distance between
two points in the high-dimensional and the pairwise distance between the points in the
low-dimensional data representation.
selection of the Student t-distribution is that it is closely related to the Gaussian distribution, as the
Student t-distribution is an infinite mixture of Gaussians. A computationally convenient property
is that it is much faster to evaluate the density of a point under a Student t-distribution than under
a Gaussian because it does not involve an exponential, even though the Student t-distribution is
equivalent to an infinite mixture of Gaussians with different variances.
The gradient of the Kullback-Leibler divergence between P and the Student-t based joint prob-
ability distribution Q (computed using Equation 4) is derived in Appendix A, and is given by
δC
= 4 ∑(pi j − qi j )(yi − y j ) 1 + kyi − y j k2
−1
. (5)
δyi j
In Figure 1(a) to 1(c), we show the gradients between two low-dimensional datapoints y i and y j as
a function of their pairwise Euclidean distances in the high-dimensional and the low-dimensional
space (i.e., as a function of kxi − x j k and kyi − y j k) for the symmetric versions of SNE, UNI-SNE,
and t-SNE. In the figures, positive values of the gradient represent an attraction between the low-
dimensional datapoints yi and y j , whereas negative values represent a repulsion between the two
datapoints. From the figures, we observe two main advantages of the t-SNE gradient over the
gradients of SNE and UNI-SNE.
First, the t-SNE gradient strongly repels dissimilar datapoints that are modeled by a small pair-
wise distance in the low-dimensional representation. SNE has such a repulsion as well, but its effect
is minimal compared to the strong attractions elsewhere in the gradient (the largest attraction in our
graphical representation of the gradient is approximately 19, whereas the largest repulsion is approx-
imately 1). In UNI-SNE, the amount of repulsion between dissimilar datapoints is slightly larger,
however, this repulsion is only strong when the pairwise distance between the points in the low-
dimensional representation is already large (which is often not the case, since the low-dimensional
representation is initialized by sampling from a Gaussian with a very small variance that is centered
around the origin).
Second, although t-SNE introduces strong repulsions between dissimilar datapoints that are
modeled by small pairwise distances, these repulsions do not go to infinity. In this respect, t-SNE
differs from UNI-SNE, in which the strength of the repulsion between very dissimilar datapoints
2586
V ISUALIZING DATA USING T-SNE
end
end
is proportional to their pairwise distance in the low-dimensional map, which may cause dissimilar
datapoints to move much too far away from each other.
Taken together, t-SNE puts emphasis on (1) modeling dissimilar datapoints by means of large
pairwise distances, and (2) modeling similar datapoints by means of small pairwise distances. More-
over, as a result of these characteristics of the t-SNE cost function (and as a result of the approximate
scale invariance of the Student t-distribution), the optimization of the t-SNE cost function is much
easier than the optimization of the cost functions of SNE and UNI-SNE. Specifically, t-SNE in-
troduces long-range forces in the low-dimensional map that can pull back together two (clusters
of) similar points that get separated early on in the optimization. SNE and UNI-SNE do not have
such long-range forces, as a result of which SNE and UNI-SNE need to use simulated annealing to
obtain reasonable solutions. Instead, the long-range forces in t-SNE facilitate the identification of
good local optima without resorting to simulated annealing.
2587
VAN DER M AATEN AND H INTON
distances of the map points from the origin. The magnitude of this penalty term and the iteration at
which it is removed are set by hand, but the behavior is fairly robust across variations in these two
additional optimization parameters.
A less obvious way to improve the optimization, which we call “early exaggeration”, is to
multiply all of the pi j ’s by, for example, 4, in the initial stages of the optimization. This means that
almost all of the qi j ’s, which still add up to 1, are much too small to model their corresponding p i j ’s.
As a result, the optimization is encouraged to focus on modeling the large p i j ’s by fairly large qi j ’s.
The effect is that the natural clusters in the data tend to form tight widely separated clusters in the
map. This creates a lot of relatively empty space in the map, which makes it much easier for the
clusters to move around relative to one another in order to find a good global organization.
In all the visualizations presented in this paper and in the supporting material, we used exactly
the same optimization procedure. We used the early exaggeration method with an exaggeration
of 4 for the first 50 iterations (note that early exaggeration is not included in the pseudocode in
Algorithm 1). The number of gradient descent iterations T was set 1000, and the momentum term
was set to α(t) = 0.5 for t < 250 and α(t) = 0.8 for t ≥ 250. The learning rate η is initially set to 100
and it is updated after every iteration by means of the adaptive learning rate scheme described by
Jacobs (1988). A Matlab implementation of the resulting algorithm is available at http://ticc.
uvt.nl/˜lvdrmaaten/tsne.
4. Experiments
To evaluate t-SNE, we present experiments in which t-SNE is compared to seven other non-parametric
techniques for dimensionality reduction. Because of space limitations, in the paper, we only com-
pare t-SNE with: (1) Sammon mapping, (2) Isomap, and (3) LLE. In the supporting material, we
also compare t-SNE with: (4) CCA, (5) SNE, (6) MVU, and (7) Laplacian Eigenmaps. We per-
formed experiments on five data sets that represent a variety of application domains. Again because
of space limitations, we restrict ourselves to three data sets in the paper. The results of our experi-
ments on the remaining two data sets are presented in the supplemental material.
In Section 4.1, the data sets that we employed in our experiments are introduced. The setup of
the experiments is presented in Section 4.2. In Section 4.3, we present the results of our experiments.
2588
V ISUALIZING DATA USING T-SNE
different objects viewed from 72 equally spaced orientations, yielding a total of 1,440 images. The
images contain 32 × 32 = 1, 024 pixels.
4.3 Results
In Figures 2 and 3, we show the results of our experiments with t-SNE, Sammon mapping, Isomap,
and LLE on the MNIST data set. The results reveal the strong performance of t-SNE compared to
the other techniques. In particular, Sammon mapping constructs a “ball” in which only three classes
(representing the digits 0, 1, and 7) are somewhat separated from the other classes. Isomap and
LLE produce solutions in which there are large overlaps between the digit classes. In contrast, t-
SNE constructs a map in which the separation between the digit classes is almost perfect. Moreover,
detailed inspection of the t-SNE map reveals that much of the local structure of the data (such as
the orientation of the ones) is captured as well. This is illustrated in more detail in Section 5 (see
Figure 7). The map produced by t-SNE contains some points that are clustered with the wrong
class, but most of these points correspond to distorted digits many of which are difficult to identify.
Figure 4 shows the results of applying t-SNE, Sammon mapping, Isomap, and LLE to the Olivetti
faces data set. Again, Isomap and LLE produce solutions that provide little insight into the class
8. Isomap and LLE require data that gives rise to a neighborhood graph that is connected.
2589
VAN DER M AATEN AND H INTON
0
1
2
3
4
5
6
7
8
9
Figure 2: Visualizations of 6,000 handwritten digits from the MNIST data set.
2590
V ISUALIZING DATA USING T-SNE
Figure 3: Visualizations of 6,000 handwritten digits from the MNIST data set.
2591
VAN DER M AATEN AND H INTON
structure of the data. The map constructed by Sammon mapping is significantly better, since it
models many of the members of each class fairly close together, but none of the classes are clearly
separated in the Sammon map. In contrast, t-SNE does a much better job of revealing the natural
classes in the data. Some individuals have their ten images split into two clusters, usually because a
subset of the images have the head facing in a significantly different direction, or because they have
a very different expression or glasses. For these individuals, it is not clear that their ten images form
a natural class when using Euclidean distance in pixel space.
Figure 5 shows the results of applying t-SNE, Sammon mapping, Isomap, and LLE to the COIL-
20 data set. For many of the 20 objects, t-SNE accurately represents the one-dimensional manifold
of viewpoints as a closed loop. For objects which look similar from the front and the back, t-SNE
distorts the loop so that the images of front and back are mapped to nearby points. For the four
types of toy car in the COIL-20 data set (the four aligned “sausages” in the bottom-left of the t-
SNE map), the four rotation manifolds are aligned by the orientation of the cars to capture the high
2592
V ISUALIZING DATA USING T-SNE
similarity between different cars at the same orientation. This prevents t-SNE from keeping the
four manifolds clearly separate. Figure 5 also reveals that the other three techniques are not nearly
as good at cleanly separating the manifolds that correspond to very different objects. In addition,
Isomap and LLE only visualize a small number of classes from the COIL-20 data set, because the
data set comprises a large number of widely separated submanifolds that give rise to small connected
components in the neighborhood graph.
2593
VAN DER M AATEN AND H INTON
Figure 6: An illustration of the advantage of the random walk version of t-SNE over a standard
landmark approach. The shaded points A, B, and C are three (almost) equidistant land-
mark points, whereas the non-shaded datapoints are non-landmark points. The arrows
represent a directed neighborhood graph where k = 3. In a standard landmark approach,
the pairwise affinity between A and B is approximately equal to the pairwise affinity be-
tween A and C. In the random walk version of t-SNE, the pairwise affinity between A
and B is much larger than the pairwise affinity between A and C, and therefore, it reflects
the structure of the data much better.
make use of the information that the undisplayed datapoints provide about the underlying manifolds.
Suppose, for example, that A, B, and C are all equidistant in the high-dimensional space. If there
are many undisplayed datapoints between A and B and none between A and C, it is much more
likely that A and B are part of the same cluster than A and C. This is illustrated in Figure 6. In this
section, we show how t-SNE can be modified to display a random subset of the datapoints (so-called
landmark points) in a way that uses information from the entire (possibly very large) data set.
We start by choosing a desired number of neighbors and creating a neighborhood graph for all
of the datapoints. Although this is computationally intensive, it is only done once. Then, for each
of the landmark points, we define a random walk starting at that landmark point and terminating
as soon as it lands on another landmark point. During a random walk, the probability of choosing
2
an edge emanating from node xi to node x j is proportional to e−kxi −x j k . We define p j|i to be the
fraction of random walks starting at landmark point xi that terminate at landmark point x j . This has
some resemblance to the way Isomap measures pairwise distances between points. However, as in
diffusion maps (Lafon and Lee, 2006; Nadler et al., 2006), rather than looking for the shortest path
through the neighborhood graph, the random walk-based affinity measure integrates over all paths
through the neighborhood graph. As a result, the random walk-based affinity measure is much less
sensitive to “short-circuits” (Lee and Verleysen, 2005), in which a single noisy datapoint provides
a bridge between two regions of dataspace that should be far apart in the map. Similar approaches
using random walks have also been successfully applied to, for example, semi-supervised learning
(Szummer and Jaakkola, 2001; Zhu et al., 2003) and image segmentation (Grady, 2006).
2594
V ISUALIZING DATA USING T-SNE
The most obvious way to compute the random walk-based similarities p j|i is to explicitly per-
form the random walks on the neighborhood graph, which works very well in practice, given that
one can easily perform one million random walks per second. Alternatively, Grady (2006) presents
an analytical solution to compute the pairwise similarities p j|i that involves solving a sparse linear
system. The analytical solution to compute the similarities p j|i is sketched in Appendix B. In pre-
liminary experiments, we did not find significant differences between performing the random walks
explicitly and the analytical solution. In the experiment we present below, we explicitly performed
the random walks because this is computationally less expensive. However, for very large data sets
in which the landmark points are very sparse, the analytical solution may be more appropriate.
Figure 7 shows the results of an experiment, in which we applied the random walk version
of t-SNE to 6,000 randomly selected digits from the MNIST data set, using all 60,000 digits to
compute the pairwise affinities p j|i . In the experiment, we used a neighborhood graph that was
constructed using a value of k = 20 nearest neighbors. 9 The inset of the figure shows the same
visualization as a scatterplot in which the colors represent the labels of the digits. In the t-SNE
map, all classes are clearly separated and the “continental” sevens form a small separate cluster.
Moreover, t-SNE reveals the main dimensions of variation within each class, such as the orientation
of the ones, fours, sevens, and nines, or the “loopiness” of the twos. The strong performance of
t-SNE is also reflected in the generalization error of nearest neighbor classifiers that are trained on
the low-dimensional representation. Whereas the generalization error (measured using 10-fold cross
validation) of a 1-nearest neighbor classifier trained on the original 784-dimensional datapoints is
5.75%, the generalization error of a 1-nearest neighbor classifier trained on the two-dimensional
data representation produced by t-SNE is only 5.13%. The computational requirements of random
walk t-SNE are reasonable: it took only one hour of CPU time to construct the map in Figure 7.
6. Discussion
The results in the previous two sections (and those in the supplemental material) demonstrate the
performance of t-SNE on a wide variety of data sets. In this section, we discuss the differences
between t-SNE and other non-parametric techniques (Section 6.1), and we also discuss a number of
weaknesses and possible improvements of t-SNE (Section 6.2).
9. In preliminary experiments, we found the performance of random walk t-SNE to be very robust under changes of k.
2595
VAN DER M AATEN AND H INTON
Figure 7: Visualization of 6,000 digits from the MNIST data set produced by the random walk
version of t-SNE (employing all 60,000 digit images).
2596
V ISUALIZING DATA USING T-SNE
where the constant outside of the sum is added in order to simplify the derivation of the gradient.
The main weakness of the Sammon cost function is that the importance of retaining small pairwise
distances in the map is largely dependent on small differences in these pairwise distances. In par-
ticular, a small error in the model of two high-dimensional points that are extremely close together
results in a large contribution to the cost function. Since all small pairwise distances constitute the
local structure of the data, it seems more appropriate to aim to assign approximately equal impor-
tance to all small pairwise distances.
In contrast to Sammon mapping, the Gaussian kernel employed in the high-dimensional space
by t-SNE defines a soft border between the local and global structure of the data and for pairs
of datapoints that are close together relative to the standard deviation of the Gaussian, the impor-
tance of modeling their separations is almost independent of the magnitudes of those separations.
Moreover, t-SNE determines the local neighborhood size for each datapoint separately based on the
local density of the data (by forcing each conditional probability distribution Pi to have the same
perplexity).
The strong performance of t-SNE compared to Isomap is partly explained by Isomap’s suscep-
tibility to “short-circuiting”. Also, Isomap mainly focuses on modeling large geodesic distances
rather than small ones.
The strong performance of t-SNE compared to LLE is mainly due to a basic weakness of LLE:
the only thing that prevents all datapoints from collapsing onto a single point is a constraint on the
covariance of the low-dimensional representation. In practice, this constraint is often satisfied by
placing most of the map points near the center of the map and using a few widely scattered points
to create large covariance (see Figure 3(b) and 4(d)). For neighborhood graphs that are almost
disconnected, the covariance constraint can also be satisfied by a “curdled” map in which there are
a few widely separated, collapsed subsets corresponding to the almost disconnected components.
Furthermore, neighborhood-graph based techniques (such as Isomap and LLE) are not capable of
visualizing data that consists of two or more widely separated submanifolds, because such data
does not give rise to a connected neighborhood graph. It is possible to produce a separate map for
each connected component, but this loses information about the relative similarities of the separate
components.
Like Isomap and LLE, the random walk version of t-SNE employs neighborhood graphs, but it
does not suffer from short-circuiting problems because the pairwise similarities between the high-
dimensional datapoints are computed by integrating over all paths through the neighborhood graph.
Because of the diffusion-based interpretation of the conditional probabilities underlying the random
walk version of t-SNE, it is useful to compare t-SNE to diffusion maps. Diffusion maps define a
“diffusion distance” on the high-dimensional datapoints that is given by
v
(t) 2
u
(t)
pik − p jk
u
D(t) (xi , x j ) = t∑
u
,
k ψ(xk )(0)
(t)
where pi j represents the probability of a particle traveling from xi to x j in t timesteps through a
graph on the data with Gaussian emission probabilities. The term ψ(x k )(0) is a measure for the local
density of the points, and serves a similar purpose to the fixed perplexity Gaussian kernel that is em-
ployed in SNE. The diffusion map is formed by the principal non-trivial eigenvectors of the Markov
matrix of the random walks of length t. It can be shown that when all (n − 1) non-trivial eigenvec-
2597
VAN DER M AATEN AND H INTON
tors are employed, the Euclidean distances in the diffusion map are equal to the diffusion distances
in the high-dimensional data representation (Lafon and Lee, 2006). Mathematically, diffusion maps
minimize
2
C = ∑ ∑ D(t) (xi , x j ) − kyi − y j k .
i j
As a result, diffusion maps are susceptible to the same problems as classical scaling: they assign
much higher importance to modeling the large pairwise diffusion distances than the small ones and
as a result, they are not good at retaining the local structure of the data. Moreover, in contrast to the
random walk version of t-SNE, diffusion maps do not have a natural way of selecting the length, t,
of the random walks.
In the supplemental material, we present results that reveal that t-SNE outperforms CCA (De-
martines and Hérault, 1997), MVU (Weinberger et al., 2004), and Laplacian Eigenmaps (Belkin and
Niyogi, 2002) as well. For CCA and the closely related CDA (Lee et al., 2000), these results can
be partially explained by the hard border λ that these techniques define between local and global
structure, as opposed to the soft border of t-SNE. Moreover, within the range λ, CCA suffers from
the same weakness as Sammon mapping: it assigns extremely high importance to modeling the
distance between two datapoints that are extremely close.
Like t-SNE, MVU (Weinberger et al., 2004) tries to model all of the small separations well but
MVU insists on modeling them perfectly (i.e., it treats them as constraints) and a single erroneous
constraint may severely affect the performance of MVU. This can occur when there is a short-circuit
between two parts of a curved manifold that are far apart in the intrinsic manifold coordinates. Also,
MVU makes no attempt to model longer range structure: It simply pulls the map points as far apart
as possible subject to the hard constraints so, unlike t-SNE, it cannot be expected to produce sensible
large-scale structure in the map.
For Laplacian Eigenmaps, the poor results relative to t-SNE may be explained by the fact that
Laplacian Eigenmaps have the same covariance constraint as LLE, and it is easy to cheat on this
constraint.
6.2 Weaknesses
Although we have shown that t-SNE compares favorably to other techniques for data visualization, t-
SNE has three potential weaknesses: (1) it is unclear how t-SNE performs on general dimensionality
reduction tasks, (2) the relatively local nature of t-SNE makes it sensitive to the curse of the intrinsic
dimensionality of the data, and (3) t-SNE is not guaranteed to converge to a global optimum of its
cost function. Below, we discuss the three weaknesses in more detail.
1) Dimensionality reduction for other purposes. It is not obvious how t-SNE will perform on
the more general task of dimensionality reduction (i.e., when the dimensionality of the data is not
reduced to two or three, but to d > 3 dimensions). To simplify evaluation issues, this paper only
considers the use of t-SNE for data visualization. The behavior of t-SNE when reducing data to two
or three dimensions cannot readily be extrapolated to d > 3 dimensions because of the heavy tails
of the Student-t distribution. In high-dimensional spaces, the heavy tails comprise a relatively large
portion of the probability mass under the Student-t distribution, which might lead to d-dimensional
data representations that do not preserve the local structure of the data as well. Hence, for tasks
2598
V ISUALIZING DATA USING T-SNE
in which the dimensionality of the data needs to be reduced to a dimensionality higher than three,
Student t-distributions with more than one degree of freedom 10 are likely to be more appropriate.
2) Curse of intrinsic dimensionality. t-SNE reduces the dimensionality of data mainly based on
local properties of the data, which makes t-SNE sensitive to the curse of the intrinsic dimensional-
ity of the data (Bengio, 2007). In data sets with a high intrinsic dimensionality and an underlying
manifold that is highly varying, the local linearity assumption on the manifold that t-SNE implicitly
makes (by employing Euclidean distances between near neighbors) may be violated. As a result,
t-SNE might be less successful if it is applied on data sets with a very high intrinsic dimensional-
ity (for instance, a recent study by Meytlis and Sirovich (2007) estimates the space of images of
faces to be constituted of approximately 100 dimensions). Manifold learners such as Isomap and
LLE suffer from exactly the same problems (see, e.g., Bengio, 2007; van der Maaten et al., 2008).
A possible way to (partially) address this issue is by performing t-SNE on a data representation
obtained from a model that represents the highly varying data manifold efficiently in a number of
nonlinear layers such as an autoencoder (Hinton and Salakhutdinov, 2006). Such deep-layer archi-
tectures can represent complex nonlinear functions in a much simpler way, and as a result, require
fewer datapoints to learn an appropriate solution (as is illustrated for a d-bits parity task by Bengio
2007). Performing t-SNE on a data representation produced by, for example, an autoencoder is
likely to improve the quality of the constructed visualizations, because autoencoders can identify
highly-varying manifolds better than a local method such as t-SNE. However, the reader should note
that it is by definition impossible to fully represent the structure of intrinsically high-dimensional
data in two or three dimensions.
3) Non-convexity of the t-SNE cost function. A nice property of most state-of-the-art dimen-
sionality reduction techniques (such as classical scaling, Isomap, LLE, and diffusion maps) is the
convexity of their cost functions. A major weakness of t-SNE is that the cost function is not convex,
as a result of which several optimization parameters need to be chosen. The constructed solutions
depend on these choices of optimization parameters and may be different each time t-SNE is run
from an initial random configuration of map points. We have demonstrated that the same choice of
optimization parameters can be used for a variety of different visualization tasks, and we found that
the quality of the optima does not vary much from run to run. Therefore, we think that the weakness
of the optimization method is insufficient reason to reject t-SNE in favor of methods that lead to con-
vex optimization problems but produce noticeably worse visualizations. A local optimum of a cost
function that accurately captures what we want in a visualization is often preferable to the global
optimum of a cost function that fails to capture important aspects of what we want. Moreover, the
convexity of cost functions can be misleading, because their optimization is often computationally
infeasible for large real-world data sets, prompting the use of approximation techniques (de Silva
and Tenenbaum, 2003; Weinberger et al., 2007). Even for LLE and Laplacian Eigenmaps, the opti-
mization is performed using iterative Arnoldi (Arnoldi, 1951) or Jacobi-Davidson (Fokkema et al.,
1999) methods, which may fail to find the global optimum due to convergence problems.
7. Conclusions
The paper presents a new technique for the visualization of similarity data that is capable of retaining
the local structure of the data while also revealing some important global structure (such as clusters
10. Increasing the degrees of freedom of a Student-t distribution makes the tails of the distribution lighter. With infinite
degrees of freedom, the Student-t distribution is equal to the Gaussian distribution.
2599
VAN DER M AATEN AND H INTON
at multiple scales). Both the computational and the memory complexity of t-SNE are O(n 2 ), but
we present a landmark approach that makes it possible to successfully visualize large real-world
data sets with limited computational demands. Our experiments on a variety of data sets show
that t-SNE outperforms existing state-of-the-art techniques for visualizing a variety of real-world
data sets. Matlab implementations of both the normal and the random walk version of t-SNE are
available for download at http://ticc.uvt.nl/˜lvdrmaaten/tsne.
In future work we plan to investigate the optimization of the number of degrees of freedom of
the Student-t distribution used in t-SNE. This may be helpful for dimensionality reduction when
the low-dimensional representation has many dimensions. We will also investigate the extension of
t-SNE to models in which each high-dimensional datapoint is modeled by several low-dimensional
map points as in Cook et al. (2007). Also, we aim to develop a parametric version of t-SNE that
allows for generalization to held-out test data by using the t-SNE objective function to train a mul-
tilayer neural network that provides an explicit mapping to the low-dimensional space.
Acknowledgments
The authors thank Sam Roweis for many helpful discussions, Andriy Mnih for supplying the word-
features data set, Ruslan Salakhutdinov for help with the Netflix data set (results for these data sets
are presented in the supplemental material), and Guido de Croon for pointing us to the analytical
solution of the random walk probabilities.
Laurens van der Maaten is supported by the CATCH-programme of the Dutch Scientific Orga-
nization (NWO), project RICH (grant 640.002.401), and cooperates with RACM. Geoffrey Hinton
is a fellow of the Canadian Institute for Advanced Research, and is also supported by grants from
NSERC and CFI and gifts from Google and Microsoft.
2600
V ISUALIZING DATA USING T-SNE
In order to make the derivation less cluttered, we define two auxiliary variables d i j and Z as follows
di j = kyi − y j k,
Z = ∑ (1 + dkl2 )−1 .
k6=l
Note that if yi changes, the only pairwise distances that change are d i j and d ji for ∀ j. Hence, the
gradient of the cost function C with respect to yi is given by
δC δC δC
δyi ∑
= + (yi − y j )
j δdi j δd ji
δC
= 2∑ (yi − y j ). (7)
j δdi j
δC
The gradient δd ij
is computed from the definition of the Kullback-Leibler divergence in Equation 6
(note that the first part of this equation is a constant).
δC δ(log qkl )
= − ∑ pkl
δdi j k6=l δdi j
δ(log qkl Z − log Z)
= − ∑ pkl
k6=l δdi j
1 δ((1 + dkl2 )−1 ) 1 δZ
= − ∑ pkl −
k6=l qkl Z δdi j Z δdi j
δ((1+dkl
2 )−1 )
δC
The gradient δdi j is only nonzero when k = i and l = j. Hence, the gradient δdi j is given by
δC pi j (1 + di2j )−2
=2 (1 + di2j )−2 − 2 ∑ pkl .
δdi j qi j Z k6=l Z
δC
= 2pi j (1 + di2j )−1 − 2qi j (1 + di2j )−1
δdi j
= 2(pi j − qi j )(1 + di2j )−1 .
2601
VAN DER M AATEN AND H INTON
It can be shown that computing the probability that a random walk initiated from a non-landmark
point (on a graph that is specified by adjacency matrix W ) first reaches a specific landmark point
is equal to computing the solution to the combinatorial Dirichlet problem in which the boundary
conditions are at the locations of the landmark points, the considered landmark point is fixed to
unity, and the other landmarks points are set to zero (Kakutani, 1945; Doyle and Snell, 1984).
In practice, the solution can thus be obtained by minimizing the combinatorial formulation of the
Dirichlet integral
1
D[x] = xT Lx,
2
where L represents the graph Laplacian. Mathematically, the graph Laplacian is given by L =
D −W , where D = diag ∑ j w1 j , ∑ j w2 j , ..., ∑ j wn j . Without loss of generality, we may reorder the
landmark points such that the landmark points come first. As a result, the combinatorial Dirichlet
integral decomposes into
1 T T LL B x L
D[xN ] = xL xN
2 B T LN x N
1 T
xL LL xL + 2xNT BT xM + xNT LN xN ,
=
2
where we use the subscript ·L to indicate the landmark points, and the subscript ·N to indicate the
non-landmark points. Differentiating D[xN ] with respect to xN and finding its critical points amounts
to solving the linear systems
LN xN = −BT . (8)
Please note that in this linear system, BT is a matrix containing the columns from the graph Lapla-
cian L that correspond to the landmark points (excluding the rows that correspond to landmark
points). After normalization of the solutions to the systems XN , the column vectors of XN contain
the probability that a random walk initiated from a non-landmark point terminates in a landmark
point. One should note that the linear system in Equation 8 is only nonsingular if the graph is com-
pletely connected, or if each connected component in the graph contains at least one landmark point
(Biggs, 1974).
Because we are interested in the probability of a random walk initiated from a landmark point
terminating at another landmark point, we duplicate all landmark points in the neighborhood graph,
and initiate the random walks from the duplicate landmarks. Because of memory constraints, it is
not possible to store the entire matrix XN into memory (note that we are only interested in a small
number of rows from this matrix, viz., in the rows corresponding to the duplicate landmark points).
Hence, we solve the linear systems defined by the columns of −B T one-by-one, and store only the
parts of the solutions that correspond to the duplicate landmark points. For computational reasons,
we first perform a Cholesky factorization of LN , such that LN = CCT , where C is an upper-triangular
matrix. Subsequently, the solution to the linear system in Equation 8 is obtained by solving the
linear systems Cy = −BT and CxN = y using a fast backsubstitution method.
References
W.E. Arnoldi. The principle of minimized iteration in the solution of the matrix eigenvalue problem.
Quarterly of Applied Mathematics, 9:17–25, 1951.
2602
V ISUALIZING DATA USING T-SNE
G.D. Battista, P. Eades, R. Tamassia, and I.G. Tollis. Annotated bibliography on graph drawing.
Computational Geometry: Theory and Applications, 4:235–282, 1994.
M. Belkin and P. Niyogi. Laplacian Eigenmaps and spectral techniques for embedding and cluster-
ing. In Advances in Neural Information Processing Systems, volume 14, pages 585–591, Cam-
bridge, MA, USA, 2002. The MIT Press.
Y. Bengio. Learning deep architectures for AI. Technical Report 1312, Universit é de Montréal,
2007.
N. Biggs. Algebraic graph theory. In Cambridge Tracts in Mathematics, volume 67. Cambridge
University Press, 1974.
H. Chernoff. The use of faces to represent points in k-dimensional space graphically. Journal of the
American Statistical Association, 68:361–368, 1973.
J.A. Cook, I. Sutskever, A. Mnih, and G.E. Hinton. Visualizing similarity data with a mixture of
maps. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics,
volume 2, pages 67–74, 2007.
M.C. Ferreira de Oliveira and H. Levkowitz. From visual data exploration to visual data mining: A
survey. IEEE Transactions on Visualization and Computer Graphics, 9(3):378–394, 2003.
V. de Silva and J.B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction.
In Advances in Neural Information Processing Systems, volume 15, pages 721–728, Cambridge,
MA, USA, 2003. The MIT Press.
P. Demartines and J. Hérault. Curvilinear component analysis: A self-organizing neural network for
nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8(1):148–154, 1997.
P. Doyle and L. Snell. Random walks and electric networks. In Carus mathematical monographs,
volume 22. Mathematical Association of America, 1984.
D.R. Fokkema, G.L.G. Sleijpen, and H.A. van der Vorst. Jacobi–Davidson style QR and QZ algo-
rithms for the reduction of matrix pencils. SIAM Journal on Scientific Computing, 20(1):94–125,
1999.
L. Grady. Random walks for image segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 28(11):1768–1783, 2006.
G.E. Hinton and S.T. Roweis. Stochastic Neighbor Embedding. In Advances in Neural Information
Processing Systems, volume 15, pages 833–840, Cambridge, MA, USA, 2002. The MIT Press.
G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504–507, 2006.
R.A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1:
295–307, 1988.
2603
VAN DER M AATEN AND H INTON
S. Kakutani. Markov processes and the Dirichlet problem. Proceedings of the Japan Academy, 21:
227–233, 1945.
D.A. Keim. Designing pixel-oriented visualization techniques: Theory and applications. IEEE
Transactions on Visualization and Computer Graphics, 6(1):59–78, 2000.
S. Lafon and A.B. Lee. Diffusion maps and coarse-graining: A unified framework for dimension-
ality reduction, graph partitioning, and data set parameterization. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 28(9):1393–1403, 2006.
J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction of data manifolds with essential
loops. Neurocomputing, 67:29–53, 2005.
J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer, New York, NY, USA,
2007.
J.A. Lee, A. Lendasse, N. Donckers, and M. Verleysen. A robust nonlinear projection method. In
Proceedings of the 8th European Symposium on Artificial Neural Networks, pages 13–20, 2000.
K.V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate Analysis. Academic Press, 1979.
M. Meytlis and L. Sirovich. On the dimensionality of face space. IEEE Transactions of Pattern
Analysis and Machine Intelligence, 29(7):1262–1267, 2007.
B. Nadler, S. Lafon, R.R. Coifman, and I.G. Kevrekidis. Diffusion maps, spectral clustering and
the reaction coordinates of dynamical systems. Applied and Computational Harmonic Analysis:
Special Issue on Diffusion Maps and Wavelets, 21:113–127, 2006.
S.A. Nene, S.K. Nayar, and H. Murase. Columbia Object Image Library (COIL-20). Technical
Report CUCS-005-96, Columbia University, 1996.
S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by Locally Linear Embedding.
Science, 290(5500):2323–2326, 2000.
J.W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers,
18(5):401–409, 1969.
L. Song, A.J. Smola, K. Borgwardt, and A. Gretton. Colored Maximum Variance Unfolding. In
Advances in Neural Information Processing Systems, volume 21 (in press), 2007.
W.N. Street, W.H. Wolberg, and O.L. Mangasarian. Nuclear feature extraction for breast tumor
diagnosis. In Proceedings of the IS&T/SPIE International Symposium on Electronic Imaging:
Science and Technology, volume 1905, pages 861–870, 1993.
M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. In Ad-
vances in Neural Information Processing Systems, volume 14, pages 945–952, 2001.
J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290(5500):2319–2323, 2000.
2604
V ISUALIZING DATA USING T-SNE
L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik. Dimensionality reduction: A compar-
ative review. Online Preprint, 2008.
K.Q. Weinberger, F. Sha, and L.K. Saul. Learning a kernel matrix for nonlinear dimensionality
reduction. In Proceedings of the 21st International Confernence on Machine Learning, 2004.
K.Q. Weinberger, F. Sha, Q. Zhu, and L.K. Saul. Graph Laplacian regularization for large-scale
semidefinite programming. In Advances in Neural Information Processing Systems, volume 19,
2007.
C.K.I. Williams. On a connection between Kernel PCA and metric multidimensional scaling. Ma-
chine Learning, 46(1-3):11–19, 2002.
X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and har-
monic functions. In Proceedings of the 20th International Conference on Machine Learning,
pages 912–919, 2003.
2605