Learning Multiple Layers of Representation: Geoffrey E. Hinton
Learning Multiple Layers of Representation: Geoffrey E. Hinton
Learning Multiple Layers of Representation: Geoffrey E. Hinton
Vol.11 No.10
1364-6613/$ see front matter 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.tics.2007.09.004
Review
Vol.11 No.10
429
Figure 1. (a) The generative model used to learn the joint distribution of digit images and digit labels. (b) Some test images that the network classifies correctly even though
it has never seen them before.
Figure 2. The generative model used in factor analysis. Each real-valued hidden
factor is chosen independently from a gaussian distribution, N(0,1), with zero
mean and unit variance. The factors are then linearly combined using weights (Wjk)
and gaussian observation noise with mean (mi) and standard deviation (si) is added
independently to each real-valued variable (i).
www.sciencedirect.com
Review
430
Vol.11 No.10
Figure 3. (a) A multilayer belief net composed of logistic binary units. To generate fantasies from the model, we start by picking a random binary state of 1 or 0 for each toplevel unit. Then we perform a stochastic downwards pass in which the probability, h i , of turning on each unit, i, is determined by applying the logistic function s(x) = 1/
(1 + exp(x)) to the total input Sjhjwji that i receives from the units, j, in the layer above, where hj is the binary state that has already been chosen for unit j. It is easy to give
each unit an additional bias, but to simplify this review biases will usually be ignored. rij is a recognition weight. (b) An illustration of explaining away in a simple logistic
belief net containing two independent, rare, hidden causes that become highly anticorrelated when we observe the house jumping. The bias of 10 on the earthquake unit
means that, in the absence of any observation, this unit is e10 times more likely to be off than on. If the earthquake unit is on and the truck unit is off, the jump unit has a total
input of 0, which means that it has an even chance of being on. This is a much better explanation of the observation that the house jumped than the odds of e20, which
apply if neither of the hidden causes is active. But it is wasteful to turn on both hidden causes to explain the observation because the probability of them both happening is
approximately e20.
(Equation 1)
(Equation 2)
Review
The combination of approximate inference for learning the generative weights, and fantasies for learning the recognition weights is
known as the wake-sleep algorithm [22].
Vol.11 No.10
431
Figure 4. (a) Two separate restricted Boltzmann machines (RBMs). The stochastic, binary variables in the hidden layer of each RBM are symmetrically connected to the
stochastic, binary variables in the visible layer. There are no connections within a layer. The higher-level RBM is trained by using the hidden activities of the lower RBM as
data. (b) The composite generative model produced by composing the two RBMs. Note that the connections in the lower layer of the composite generative model are
directed. The hidden states are still inferred by using bottom-up recognition connections, but these are no longer part of the generative model.
www.sciencedirect.com
432
Review
Vol.11 No.10
Review
Vol.11 No.10
433
References
1 Rumelhart, D.E. et al. (1986) Learning representations by backpropagating errors. Nature 323, 533536
2 Hinton, G.E. and Salakhutdinov, R.R. (2006) Reducing the
dimensionality of data with neural networks. Science 313, 504507
3 Lee, T.S. et al. (1998) The role of the primary visual cortex in higher
level vision. Vision Res. 38, 24292454
4 Felleman, D.J. and Van Essen, D.C. (1991) Distributed hierarchical
processing in the primate cerebral cortex. Cereb. Cortex 1, 147
5 Mumford, D. (1992) On the computational architecture of the
neocortex. II. The role of cortico-cortical loops. Biol. Cybern. 66,
241251
6 Dayan, P. and Abbott, L.F. (2001) Theoretical Neuroscience:
Computational and Mathematical Modeling of Neural Systems, MIT
Press
7 Roweis, S. and Ghahramani, Z. (1999) A unifying review of linear
gaussian models. Neural Comput. 11, 305345
8 Marks, T.K. and Movellan, J.R. (2001) Diffusion networks, products of
experts, and factor analysis. In Proceedings of the International
Conference on Independent Component Analysis (Lee, T. W. et al.,
eds), pp. 481485, http://citeseer.ist.psu.edu/article/marks01diffusion.
html
9 Bell, A.J. and Sejnowski, T.J. (1995) An information-maximization
approach to blind separation and blind deconvolution. Neural
Comput. 7, 11291159
10 Hyvarinen, A. et al. (2001) Independent Component Analysis, Wiley
11 Bartlett, M.S. et al. (2002) Face recognition by independent component
analysis. IEEE Trans. Neural Netw. 13, 14501464
12 Olshausen, B.A. and Field, D. (1996) Emergence of simple-cell
receptive field properties by learning a sparse code for natural
images. Nature 381, 607609
13 Pearl, J. (1988) Probabilistic Inference in Intelligent Systems: Networks
of Plausible Inference, Morgan Kaufmann
14 Lewicki, M.S. and Sejnowski, T.J. (1997) Bayesian unsupervised
learning of higher order structure.
In Advances in Neural
Information Processing Systems (Vol. 9) (Mozer, M.C. et al., eds), pp.
529535, MIT Press
15 Hoyer, P.O. and Hyvarinen, A. (2002) A multi-layer sparse coding
network learns contour coding from natural images. Vision Res. 42,
15931605
16 Portilla, J. et al. (2004) Image denoising using Gaussian scale
mixtures in the wavelet domain. IEEE Trans. Image Process. 12,
13381351
17 Schwartz, O. et al. (2006) Soft mixer assignment in a hierarchical
generative model of natural scene statistics. Neural Comput. 18, 2680
2718
18 Karklin, Y. and Lewicki, M.S. (2003) Learning higher-order structures
in natural images. Network 14, 483499
19 Cowell, R.G. et al. (2003) Probabilistic Networks and Expert Systems,
Springer
20 OReilly, R.C. (1998) Six principles for biologically based computational
models of cortical cognition. Trends Cogn. Sci. 2, 455462
21 Hinton, G.E. and Zemel, R.S. (1994) Autoencoders, minimum
description length, and Helmholtz free energy. Adv. Neural Inf.
Process. Syst. 6, 310
22 Hinton, G.E. et al. (1995) The wake-sleep algorithm for self-organizing
neural networks. Science 268, 11581161
23 Neal, R.M. and Hinton, G.E. (1998) A new view of the EM algorithm
that justifies incremental, sparse and other variants. In Learning in
Graphical Models (Jordan, M.I., ed.), pp. 355368, Kluwer Academic
Publishers
24 Jordan, M.I. et al. (1999) An introduction to variational methods for
graphical models. Mach. Learn. 37, 183233
25 Winn, J. and Jojic, N. (2005) LOCUS: Learning object classes with
unsupervised segmentation, Tenth IEEE International Conference on
Computer Vision (Vol. 1), pp. 756763, IEEE Press
26 Bishop, C.M. (2006) Pattern Recognition and Machine Learning,
Springer
27 Bishop, C.M. et al. (2002) VIBES: a variational inference engine for
Bayesian networks. Adv. Neural Inf. Process. Syst. 15, 793800
28 Hinton, G.E. (2007) Boltzmann Machines, Scholarpedia
29 Hinton, G.E. (2002) Training products of experts by minimizing
contrastive divergence. Neural Comput. 14, 17111800
Review
434
30 Hinton, G.E. et al. (2006) A fast learning algorithm for deep belief nets.
Neural Comput. 18, 15271554
31 Decoste, D. and Schoelkopf, B. (2002) Training invariant support vector
machines. Mach. Learn. 46, 161190
32 Bengio, Y. et al. (2007) Greedy layerwise training of deep networks. In
Advances in Neural Information Processing Systems (Vol. 19)
(Schoelkopf, B. et al., eds), pp. 153160, MIT Press
33 Ranzato, M. et al. (2007) Efficient learning of sparse representations
with an energy-based model. In Advances in Neural Information
Processing Systems (Vol. 19) (Schoelkopf, B. et al., eds), pp. 1137
1144, MIT Press
34 Sutskever, I. and Hinton, G.E. (2007) Learning multilevel distributed
representations for high-dimensional sequences. In Proceeding of the
Eleventh International Conference on Artificial Intelligence and
Statistics (Meila, M. and Shen, X., eds), pp. 544551, SAIS
35 Taylor, G.W. et al. (2007) Modeling human motion using binary latent
variables. In Advances in Neural Information Processing Systems (Vol.
19) (Schoelkopf, B. et al., eds), pp. 13451352, MIT Press
36 Roth, S. and Black, M.J. (2005) Fields of experts: a framework for
learning image priors, Computer Vision and Pattern Recognition (Vol.
2), pp. 860867, IEEE Press
37 Ranzato, M. et al. (2007) A unified energy-based framework for
unsupervised learning. In Proc. Eleventh International Conference
on Artificial Intelligence and Statistics (Meila, M. and Shen, X.,
eds), pp. 368376, SAIS
Vol.11 No.10