Exponential Expressivity in Deep Neural Networks Through Transient Chaos
Exponential Expressivity in Deep Neural Networks Through Transient Chaos
Exponential Expressivity in Deep Neural Networks Through Transient Chaos
Abstract
We combine Riemannian geometry with the mean field theory of high dimensional
chaos to study the nature of signal propagation in generic, deep neural networks
with random weights. Our results reveal an order-to-chaos expressivity phase
transition, with networks in the chaotic phase computing nonlinear functions whose
global curvature grows exponentially with depth but not width. We prove this
generic class of deep random functions cannot be efficiently computed by any shal-
low network, going beyond prior work restricted to the analysis of single functions.
Moreover, we formalize and quantitatively demonstrate the long conjectured idea
that deep networks can disentangle highly curved manifolds in input space into flat
manifolds in hidden space. Our theoretical analysis of the expressive power of deep
networks broadly applies to arbitrary nonlinearities, and provides a quantitative
underpinning for previously abstract notions about the geometry of deep functions.
1 Introduction
Deep feedforward neural networks, with multiple hidden layers, have achieved remarkable perfor-
mance across many domains [1–4]. A key factor thought to underlie their success is their high
expressivity. This informal notion has manifested itself primarily in two forms of intuition. The first
is that deep networks can compactly express highly complex functions over input space in a way that
shallow networks with one hidden layer and the same number of neurons cannot. The second piece
of intuition, which has captured the imagination of machine learning [5] and neuroscience [6] alike,
is that deep neural networks can disentangle highly curved manifolds in input space into flattened
manifolds in hidden space, to aid the performance of simple linear readouts. These intuitions, while
attractive, have been difficult to formalize mathematically, and thereby rigorously test.
For the first intuition, seminal works have exhibited examples of particular functions that can be
computed with a polynomial number of neurons (in the input dimension) in a deep network but
require an exponential number of neurons in a shallow network [7–11]. This raises a central open
question: are such functions merely rare curiosities, or is any function computed by a generic deep
network not efficiently computable by a shallow network? The theoretical techniques employed in
prior work both limited the applicability of theory to specific nonlinearities and dictated the particular
measure of deep functional complexity involved. For example [7] focused on ReLu nonlinearities
and number of linear regions as a complexity measure, while [8] focused on sum-product networks
and the number of monomials as complexity measure, and [12] focused on Pfaffian nonlinearities and
topological measures of complexity, like the sum of Betti numbers of a decision boundary. However,
Code to reproduce all results available at: https://github.com/ganguli-lab/deepchaos
see [13] for an interesting analysis of a general class of compositional functions. The limits of prior
theoretical techniques raise another central question: is there a unifying theoretical framework for
deep neural expressivity that is simultaneously applicable to arbitrary nonlinearities, generic networks,
and a natural, general measure of functional complexity?
Here we attack both central problems of deep neural expressivity by combining a very different
set of tools, namely Riemannian geometry [14] and dynamical mean field theory [15]. This novel
combination enables us to show that for very broad classes of nonlinearities, even random deep
neural networks can construct hidden internal representations whose global extrinsic curvature grows
exponentially with depth but not width. Our geometric framework enables us to quantitatively define
a notion of disentangling and verify this notion even in deep random networks. Furthermore, our
methods yield insights into the emergent, deterministic nature of signal propagation through large
random feedforward networks, revealing the existence of an order to chaos transition as a function of
the statistics of weights and biases. We find that the transient, finite depth evolution in the chaotic
regime underlies the origins of exponential expressivity in deep random networks.
In our companion paper [16], we study several related measures of expressivity in deep random
neural networks with piecewise linear activations.
2
Figure 1: Dynamics of the squared length q l for a sigmoidal network (φ(h) = tanh(h)) with 1000
hidden units. (A) The iterative length map in (3) for 3 different σw at σb = 0.3. Theoretical
predictions (solid lines) match well with individual network simulations (dots). Stars reflect fixed
points q ∗ of the map. (B) The iterative dynamics of the length map yields rapid convergence of q l
to its fixed point q ∗ , independent of initial condition (lines=theory; dots=simulation). (C) q ∗ as a
function of σw and σb . (D) Number of iterations required to achieve ≤ 1% fractional deviation off
the fixed point. The (σb , σw ) pairs in (A,B) are marked with color matched circles in (C,D).
for a derivation of (3). Intuitively, the integral over z in (3) replaces an average over the empirical
distribution of hli across neurons i in layer l at large layer width Nl .
The function V in (3) is an iterative variance, or length, map that predicts how the length of an input in
(2) changes as it propagates through the network. This length map is plotted in Fig. 1A for the special
case of a sigmoidal nonlinearity, φ(h) = tanh(h). For monotonic nonlinearities, this length map is
a monotonically increasing, concave function whose intersections with the unity line determine its
fixed points q ∗ (σw , σb ). For σb = 0 and σw < 1, the only intersection is at q ∗ = 0. In this bias-free,
small weight regime, the network shrinks all inputs to the origin. For σw > 1 and σb = 0, the q ∗ = 0
fixed point becomes unstable and the length map acquires a second nonzero fixed point, which is
stable. In this bias-free, large weight regime, the network expands small inputs and contracts large
inputs. Also, for any nonzero bias σb , the length map has a single stable non-zero fixed point. In such
a regime, even with small weights, the injected biases at each layer prevent signals from decaying to
0. The dynamics of the length map leads to rapid convergence of length to its fixed point with depth
(Fig. 1B,D), often within only 4 layers. The fixed points q ∗ (σw , σb ) are shown in Fig. 1C.
The dynamics of the two diagonal terms are each theoretically predicted by the length map in (3). We
l
derive (see SM) a correlation map C that predicts the layer-wise dynamics of q12 :
Z
l
q12 = C(cl−1 l−1 l−1 2
12 , q11 , q22 | σw , σb ) ≡ σw Dz1 Dz2 φ (u1 ) φ (u2 ) + σb2 , (5)
q q q
l−1 l−1
u1 = q11 z1 , u2 = q22 cl−1
12 1z + 1 − (cl−1 2
12 ) z 2 ,
l l −1/2
where cl12 = q12
l
(q11 q22 ) is the correlation coefficient. Here z1 and z2 are independent standard
Gaussian variables, while u1 and u2 are correlated Gaussian variables with covariance matrix
l−1
hua ub i = qab . Together, (3) and (5) constitute a theoretical prediction for the typical evolution of
the geometry of 2 points in (4) in a fixed large network.
Analysis of these equations reveals an interesting order to chaos transition in the σw and σb plane. In
particular, what happens to two nearby points as they propagate through the layers? Their relation to
each other can be tracked by the correlation coefficient cl12 between the two points, which approaches
3
a fixed point c∗ (σw , σb ) at large depth. Since the length of each point rapidly converges to q ∗ (σw , σb ),
as shown in Fig. 1BD, we can compute c∗ by simply setting q11 l l
= q22 = q ∗ (σw , σb ) in (5) and
∗
dividing by q to obtain an iterative correlation coefficient map, or C-map, for cl12 :
1 ∗ ∗
cl12 = C(cl−1
12 , q , q | σw , σb ). (6)
q∗
This C-map is shown in Fig. 2A. It always has a fixed point at c∗ = 1 as can be checked by direct
calculation. However, the stability of this fixed point depends on the slope of the map at 1, which is
∂cl12 √ ∗ 2
Z
2
Dz φ0
χ1 ≡ l−1 = σw q z . (7)
∂c12
c=1
See SM for a derivation of (7). If the slope χ1 is less than 1, then the C-map is above the unity line,
the fixed point at 1 under the C-map in (6) is stable, and nearby points become more similar over time.
Figure 2: Dynamics of correlations, cl12 , in a sigmoidal network with φ(h) = tanh(h). (A) The
C-map in (6) for the same σw and σb = 0.3 as in Fig. 1A. (B) The C-map dynamics, derived from
both theory, through (6) (solid lines) and numerical simulations of (1) with Nl = 1000 (dots) (C)
Fixed points c∗ of the C-map. (D) The slope of the C-map at 1, χ1 , partitions the space (black dotted
line at χ1 = 1) into chaotic (χ1 > 1, c∗ < 1) and ordered (χ1 < 1, c∗ = 1) regions.
Conversely, if χ1 > 1 then this fixed point is unstable, and nearby points separate as they propagate
through the layers. Thus we can intuitively understand χ1 as a multiplicative stretch factor. This
intuition can be made precise by considering the Jacobian Jlij = Wij φ (hl−1
l 0
j ) at a point hj
l−1
with
∗
length q . J is a linear approximation of the network map from layer l − 1 to l in the vicinity of hl−1 .
l
Therefore a small random perturbation hl−1 + u will map to hl + Ju. The growth of the perturbation,
||Ju||22 /||u||22 becomes χ1 (q ∗ ) after averaging over the random perturbation u, weight matrix Wl ,
and Gaussian distribution of hl−1 i across i. Thus χ1 directly reflects the typical multiplicative growth
or shrinkage of a random perturbation across one layer.
The dynamics of the iterative C-map and its agreement with network simulations is shown in Fig.
2B. The correlation dynamics are much slower than the length dynamics because the C-map is closer
to the unity line (Fig. 2A) than the length map (Fig. 1A). Thus correlations typically take about 20
layers to approach the fixed point, while lengths need only 4. The fixed point c∗ and slope χ1 of
the C-map are shown in Fig. 2CD. For any fixed, finite σb , as σw increases three qualitative regions
occur. For small σw , c∗ = 1 is the only fixed point, and it is stable because χ1 < 1. In this strong
bias regime, any two input points converge to each other as they propagate through the network. As
σw increases, χ1 increases and crosses 1, destabilizing the c∗ = 1 fixed point. In this intermediate
regime, a new stable fixed point c∗ appears, which decreases as σw increases. Here an equal footing
competition between weights and nonlinearities (which de-correlate inputs) and the biases (which
correlate them), leads to a finite c∗ . At larger σw , the strong weights overwhelm the biases and
maximally de-correlate inputs to make them orthogonal, leading to a stable fixed point at c∗ = 0.
Thus the equation χ1 (σw , σb ) = 1 yields a phase transition boundary in the (σw , σb ) plane, separating
it into a chaotic (or ordered) phase, in which nearby points separate (or converge). In dynamical
systems theory, the logarithm of χ1 is related to the well known Lyapunov exponent which is positive
(or negative) for chaotic (or ordered) dynamics. However, in a feedforward network, the dynamics is
truncated at a finite depth D, and hence the dynamics are a form of transient chaos.
4
4 The propagation of manifold geometry through deep networks
Now consider a 1 dimensional manifold x0 (θ) in input space, where θ is an intrinsic scalar coordinate
on the manifold. This manifold propagates to a new manifold hl (θ) = hl (x0 (θ)) in the vector
space of inputs to layer l. The typical geometry of the manifold in the l’th layer is summarized
by q l (θ1 , θ2 ), which for any θ1 and θ2 is defined by (4) with the choice x0,a = x0 (θ1 ) and x0,b =
x0 (θ2 ). The theory for the propagation of pairs of points applies to all pairs of points on the
manifold, so intuitively, we expect that in the chaotic phase of a sigmoidal network, the manifold
should in some sense de-correlate, and become more complex, while in the ordered phase the
manifold should contract around a central point. This theoretical prediction of equations (3) and
(5) is quantitatively
√ confirmed
in simulationsin Fig. 3, when the input is a simple manifold, the
circle, h1 (θ) = N1 q u0 cos(θ) + u1 sin(θ) , where u0 and u1 form an orthonormal basis for a 2
dimensional subspace of RN1 in which the circle lives. The scaling is chosen so that each neuron has
input activity O(1). Also, for simplicity, we choose the fixed point radius q = q ∗ in Fig. 3.
Figure 3: Propagating a circle through three random sigmoidal networks with varying σw and fixed
σb = 0.3. (A) Projection of hidden inputs of simulated networks at layer 5 and 10 onto their first
three principal components. Insets show the fraction of variance explained by the first 5 singular
values. For large weights (bottom), the distribution of singular
R values gets flatter and the projected
curve is more tangled. (B) The autocorrelation, cl12 (∆θ) = dθ q l (θ, θ + ∆θ)/q ∗ , of hidden inputs
as a function of layer for simulated networks. (C) The theoretical predictions from (6) (solid lines)
compared to the average (dots) and standard deviation across θ (shaded) in a simulated network.
To quantitatively understand the layer-wise growth of complexity of this manifold, it is useful to turn
to concepts in Riemannian geometry [14]. First, at each point θ, the manifold h(θ) (we temporarily
suppress the layer index l) has a tangent, or velocity vector v(θ) = ∂θ h(θ). Intuitively, curvature
is related to how quickly this tangent vector rotates in the ambient space RN as one moves along
the manifold, or in essence the acceleration vector a(θ) = ∂θ v(θ). Now at each point θ, when both
are nonzero, v(θ) and a(θ) span a 2 dimensional subspace of RN . Within this subspace, there is a
unique circle of radius R(θ) that has the same position, velocity and acceleration vector as the curve
h(θ) at θ. This circle is known as the osculating circle (Fig. 4A), and the extrinsic curvature κ(θ) of
the curve is defined as κ(θ) = 1/R(θ). Thus, intuitively, small radii of curvature R(θ) imply high
extrinsic curvature κ(θ). The extrinsic curvature of a curve depends only on its image in RN and
is invariant with respect to the particular parameterizationpθ → h(θ). For any parameterization, an
explicit expression for κ(θ) is given by κ(θ) = (v · v)−3/2 (v · v)(a · a) − (v · a)2 [14]. Note that
under a unit speed parameterization of the curve, so that v(θ) · v(θ) = 1, we have v(θ) · a(θ) = 0,
and κ(θ) is simply the norm of the acceleration vector.
5
Figure 4: Propagation of extrinsic curvature and length in a network with 1000 hidden units. (A)
An osculating circle. (B) A curve with unit tangent vectors at 4 points in ambient space, and
the image of these points under the Gauss map. (C-E) Propagation of curvature metrics based
on both theory derived from iterative maps in (3), (6) and (8) (solid lines) and simulations using
(1) (dots). (F) Schematic of the normal vector, tangent plane, and principal curvatures for a 2D
manifold embedded in R3 . (G) average principal curvatures for the largest and smallest 4 principal
curvatures (κ±1 , . . . , κ±4 ) across locations θ within one network. The principal curvatures all grow
exponentially as we backpropagate to the input layer. Panels F,G are discussed in Sec. 5.
Another measure of the curve’s complexity is the length LE of its image in the ambient Euclidean
space. The Euclidean metric in RN induces a metric g E (θ) = v(θ) · v(θ) on the curve, p so that the
distance dLE moved in RN as p one moves from θ to θ + dθ on the curve is dLE = g E (θ)dθ. The
total curve length is LE =
R
g E (θ)dθ. However, even straight line segments can have a large
Euclidean length. Another interesting measure of length that takes into account curvature, is the
length of the image of the curve under the Gauss map. For a K dimensional manifold M embedded in
RN , the Gauss map (Fig. 4B) maps a point θ ∈ M to its K dimensional tangent plane Tθ M ∈ GK,N ,
where GK,N is the Grassmannian manifold of all K dimensional subspaces in RN . In the special case
of K = 1, GK,N is the sphere SN −1 with antipodal points identified, since a 1-dimensional subspace
can be identified with a unit vector, modulo sign.pThe Gauss map takes a point θ on the curve and
maps it to the unit velocity vector v̂(θ) = v(θ)/ v(θ) · v(θ). In particular, the natural metric on
SN −1 induces a Gauss metric on the curve, given by g G (θ) = (∂θ v̂(θ)) · (∂θ v̂(θ)), which measures
G
how quickly the unit tangent vector v̂(θ) changes as θ changes. Thus the distance p dL moved in
the Grassmannian GK,N as one moves from θ to θ + p dθ on the curve is dLG = g G (θ)dθ, and the
G
R
length of the curve under the Gauss map is L = g G (θ)dθ. Furthermore, the Gauss metric is
related to the extrinsic curvature and the Euclidean metric via the relation g G (θ) = κ(θ)2 g E (θ) [14].
To illustrate these concepts, compute all of them for the circle h1 (θ) defined above:
√ it is useful to √
g (θ) = N q, L = 2π N q, κ(θ) = 1/ N q,√g G (θ) = 1, and LG = 2π. As expected, κ(θ) is
E E
the inverse of the radius of curvature, which is N q. Now consider how these quantities change
if the circle is scaled up so that h(θ) → χh(θ). The length LE and radius scale up by χ, but the
curvature κ scales down as χ−1 , and so LG does not change. Thus linear expansion increases length
and decreases curvature, thereby maintaining constant Grassmannian length LG .
We now show that nonlinear propagation of this same circle through a deep network can behave very
differently from linear expansion: in the chaotic regime, length can increase without any decrease
√ the scaling with N in the above quantities, we will work with the
in extrinsic curvature! To remove
renormalized quantities κ̄ = N κ, ḡ E = N1 g E , and L̄E = √1N LE . Thus, 1/(κ̄)2 can be thought
of as a radius of curvature squared per neuron of the osculating circle, while (L̄E )2 is the squared
Euclidean length of the curve per neuron. For the circle, these quantities are q and 2πq respectively.
For simplicity, in the inputs to the first layer of neurons, we begin with a circle h1 (θ) with squared
6
radius per neuron q 1 = q ∗ , so this radius is already at the fixed point of the length map in (3). In the
SM, we derive an iterative formula for the extrinsic curvature and Euclidean metric of this manifold
as it propagates through the layers of a deep network:
χ2 1 l−1 2
ḡ E,l = χ1 ḡ E,l−1 (κ̄l )2 = 3 + (κ̄ ) , ḡ E,1 = q ∗ , (κ̄1 )2 = 1/q ∗ . (8)
χ21 χ1
where χ1 is the stretch factor defined in (7) and χ2 is defined analogously as
√ ∗ 2
Z
2
Dz φ00
χ2 = σw q z . (9)
Figure 5: Deep networks in the chaotic regime are more expressive than shallow networks. (A)
Activity of four different neurons in the output layer as a function of the input, θ for three networks
of different depth (width Nl = 1, 000). (B) Linear regression of the output activity onto a random
function (black) shows closer predictions (blue) with deeper networks (bottom) than shallow networks
(top). (C) Decomposing the prediction error by frequency shows shallow networks cannot capture high
frequency content in random functions but deep networks can (yellow=high error). (D) Increasing
the width of a one hidden layer network up to 10, 000 does not decrease error at high frequencies.
7
Theorem 1. Suppose φ(h) is monotonically non-decreasing with bounded dynamic range R, i.e.
maxh φ(h) − minh φ(h) = R. Further suppose that x0 (θ) is a curve in input space such that no 1D
projection of ∂θ x(θ) changes sign more than s times over the range of θ. Then for any choice of W1
and b1 the Euclidean length of x1 (θ), satisfies LE ≤ N1 (1 + s)R.
√ s = 1 and for the tanh nonlinearity, R = 2, so in this special case, the normalized
For the circle input,
length L̄E ≤ 2 N1 . In contrast, for deep networks in the chaotic regime L̄E grows exponentially
with depth in h space, and so consequently also in x space. Therefore the length of curves typically
expand exponentially in depth even for random deep networks, but can only expand as the square
root of width no matter what shallow network is chosen. Moreover, as we have seen above, it is the
exponential growth of L¯E that fundamentally drives the exponential growth of L̄G with depth. Indeed
shallow random networks exhibit minimal growth in expressivity even at large widths (Figure 5D).
We have focused so far on how simple manifolds in input space can acquire both exponential
Euclidean and Grassmannian length with depth, thereby exponentially de-correlating and filling up
hidden representation space. Another natural question is how the complexity of a decision boundary
grows as it is backpropagated to the input layer. Consider a linear classifier y = sgn(β · xD − β0 )
acting on the final layer. In this layer, the N − 1 dimensional decision boundary is the hyperplane
β ·xD −β0 = 0. However, in the input layer x0 , the decision boundary is a curved N −1 dimensional
manifold M that arises as the solution set of the nonlinear equation G(x0 ) ≡ β · xD (x0 ) − β0 = 0,
where xD (x0 ) is the nonlinear feedforward map from input to output.
At any point x∗ on the decision boundary in layer l, the gradient ∇G ~ is perpendicular to the N − 1
dimensional tangent plane Tx∗ M (see Fig. 4F). The normal vector ∇G, ~ along with any unit tangent
vector v̂ ∈ Tx∗ M, spans a 2 dimensional subspace whose intersection with M yields a geodesic
curve in M passing through x∗ with velocity vector v̂. This geodesic will have extrinsic curvature
κ(x∗ , v̂). Maximizing this curvature over v̂ yields the first principal curvature κ1 (x∗ ). A sequence
of successive maximizations of κ(x∗ , v̂), while constraining v̂ to be perpendicular to all previous
solutions, yields the sequence of principal curvatures κ1 (x∗ ) ≥ κ2 (x∗ ) ≥ · · · ≥ κN −1 (x∗ ). These
principal curvatures arise as the eigenvalues of a normalized Hessian operator projected onto the
~ −1 ∂2G b b T
tangent plane Tx∗ M: H = ||∇G|| 2 P ∂x∂xT P, where P = I − ∇G∇G is the projection operator
onto Tx∗ M and ∇G b is the unit normal vector [14]. Intuitively, near x∗ , the decision boundary M
can be approximated as a paraboloid with a quadratic form H whose N − 1 eigenvalues are the
principal curvatures κ1 , . . . , κN −1 (Fig. 4F).
We compute these curvatures numerically as a function of depth in Fig. 4G (see SM for details).
We find, remarkably, that a subset of principal curvatures grow exponentially with depth. Here
the principal curvatures are signed, with positive (negative) curvature indicating that the associated
geodesic curves towards (away from) the normal vector ∇G. ~ Thus the decision boundary can
become exponentially curved with depth, enabling highly complex classifications. Moreover, this
exponentially curved boundary is disentangled and mapped to a flat boundary in the output layer.
7 Discussion
Fundamentally, neural networks compute nonlinear maps between high dimensional spaces, for
example from RN1 → RND , and it is unclear what the most appropriate mathematics is for under-
standing such daunting spaces of maps. Previous works have attacked this problem by restricting
the nature of the nonlinearity involved (e.g. piecewise linear, sum-product, or Pfaffian) and thereby
restricting the space of maps to those amenable to special theoretical analysis methods (combinatorics,
polynomial relations, or topological invariants). We have begun a preliminary exploration of the
expressivity of such deep functions based on Riemannian geometry and dynamical mean field theory.
We demonstrate that networks in a chaotic phase compactly exhibit functions that exponentially grow
the global curvature of simple one dimensional manifolds from input to output and the local curvature
of simple co-dimension one manifolds from output to input. The former captures the notion that deep
neural networks can efficiently compute highly expressive functions in ways that shallow networks
8
cannot, while the latter quantifies and demonstrates the power of deep neural networks to disentangle
curved input manifolds, an attractive idea that has eluded formal quantification.
Moreover, our analysis of a maximum entropy distribution over deep networks constitutes an im-
portant null model of deep signal propagation that can be used to assess and understand different
behavior in trained networks. For example, the metrics we have adapted from Riemannian geometry,
combined with an understanding of their behavior in random networks, may provide a basis for
understanding what is special about trained networks. Furthermore, while we have focused on the
notion of input-output chaos, the duality between inputs and synaptic weights imply a form of weight
chaos, in which deep neural networks rapidly traverse function space as weights change (see SM).
Indeed, just as autocorrelation lengths between outputs as a function of inputs shrink exponentially
with depth, so too will autocorrelations between outputs as a function of weights.
But more generally, to understand functions, we often look to their graphs. The graph of a map from
RN1 → RND is an RN1 dimensional submanifold of RN1 +ND , and therefore has both high dimension
and co-dimension. We speculate that many of the secrets of deep learning may be uncovered by
studying the geometry of this graph as a Riemannian manifold, and understanding how it changes
with both depth and learning.
References
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
[3] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan
Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up
end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
[4] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J
Guibas, and Jascha Sohl-Dickstein. Deep knowledge tracing. In Advances in Neural Information
Processing Systems, pages 505–513, 2015.
[5] Yoshua Bengio, Aaron Courville, and Pierre Vincent. Representation learning: A review and
new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):
1798–1828, 2013.
[6] James J DiCarlo and David D Cox. Untangling invariant object recognition. Trends in cognitive
sciences, 11(8):333–341, 2007.
[7] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of
linear regions of deep neural networks. In Advances in neural information processing systems,
pages 2924–2932, 2014.
[8] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in
Neural Information Processing Systems, pages 666–674, 2011.
[9] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv
preprint arXiv:1512.03965, 2015.
[10] Matus Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint
arXiv:1509.08101, 2015.
[11] James Martens, Arkadev Chattopadhya, Toni Pitassi, and Richard Zemel. On the representational
efficiency of restricted boltzmann machines. In Advances in Neural Information Processing
Systems, pages 2877–2885, 2013.
[12] Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A
comparison between shallow and deep architectures. Neural Networks and Learning Systems,
IEEE Transactions on, 25(8):1553–1565, 2014.
9
[13] Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning real and boolean functions:
When is deep better than shallow. arXiv preprint arXiv:1603.00988, 2016.
[14] John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer
Science & Business Media, 2006.
[15] Haim Sompolinsky, A Crisanti, and HJ Sommers. Chaos in random neural networks. Physical
Review Letters, 61(3):259, 1988.
[16] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the
expressive power of deep neural networks. arXiv preprint, 2016.
10
Supplementary Material
Below is a series of appendices giving derivations of results in the main paper, followed by details of
results along with more visualizations.
Note: code to programmatically reproduce all plots in the paper in Jupyter notebooks will be released
upon publication.
As a single input point x0 propagates through the network, it’s length in downstream layers can either
grow or shrink. To track the propagation of this length, we track the normalized squared length of the
input vector at each layer,
Nl
1 X
ql = (hl )2 . (11)
Nl i=1 i
This length is the second moment of the empirical distribution of inputs hli across all Nl neurons in
layer l for a fixed set of weights. This empirical distribution is expected to be Gaussian for large Nl ,
since each individual hli = wl,i · φ(hl−1 ) + bli is Gaussian distributed, as a sum of a large number
of independent random variables, and each hli is independent of hlj for i 6= j because the synaptic
weights vectors and biases into each neuron are chosen independently.
While the mean of this Gaussian is 0, its variance can be computed by considering the variance of the
input to a single neuron:
Nl−1
D 2 E
l 2 1 X
q l = (hli )2 = wl,i · φ(hl−1 ) 2
φ(hl−1 2 2
+ (bi ) = σw i ) + σb , (12)
Nl−1 i=1
where h·i denotes an average over the distribution of weights and biases into neuron i at layer l. Here
we have used the identity hwjl,i wkl,i i = δjk σw
2
/Nl−1 . Now the empirical distribution of inputs across
layer l − 1 is also Gaussian, with mean zero and variance q l−1 . Therefore we can replace the average
over neurons in layer l − 1 in (12) with an integral over a Gaussian random variable, obtaining
Z p 2
q l = V(q l−1 | σw , σb ) ≡ σw 2
Dz φ q l−1 z + σb2 , for l = 2, . . . , D, (13)
z2
where Dz = √dz 2π
e− 2 is the standard Gaussian measure, and the initial condition for the variance
map is q 1 = σw 2
q0 + σb2 , where q 0 = N10 x0 · x0 is the length in the initial activity layer. The function
V in (13) is an iterative variance map that predicts how the length of an input in (11) changes as it
propagates through the network. Its derivation relies on the well-known self-averaging assumption
in the statistical physics of disordered systems, which, in our context, means that the empirical
distribution of inputs across neurons for a fixed network converges for large width, to the distribution
of inputs to a single neuron across random networks.
11
A.2 Derivation of a correlation map for the propagation of two points
Now consider the layer-wise propagation of two inputs x0,1 and x0,2 . The geometry of these two
inputs as they propagate through the layers is captured by the 2 by 2 matrix of inner products
Nl
l 1 X
qab = hl (x0,a ) hli (x0,b ) a, b ∈ {1, 2}. (14)
Nl i=1 i
The joint empirical distribution of hli (x0,a ) and hli (x0,a ) across i at large Nl will converge to a 2
l
dimensional Gaussian distribution with covariance qab . Propagating this joint distribution forward
one layer using ideas similar to the derivation above for 1 input yields
Z
l l−1 l−1 l−1 2
q12 = C(c12 , q11 , q22 | σw , σb ) ≡ σw Dz1 Dz2 φ (u1 ) φ (u2 ) + σb2 , (15)
q q q
l−1 l−1
u1 = q11 z1 , u2 = q22 cl−1
12 z1 + 1 − (cl−1 2
12 ) z2 ,
l
q12
where cl12 = √ l
√ l
is the correlation coefficient (CC). Here z1 and z2 are independent standard
q11 q12
Gaussian variables, while u1 and u2 are correlated Gaussian variables with covariance matrix
l−1
hua ub i = qab . The integration over z1 and z2 can be thought of as the large Nl limit of sums over
hi (x ) and hli (x0,b ).
l 0,a
When both input points are at their fixed point length, q ∗ , the dynamics of their correlation coefficient
can be obtained by simply setting q11 l l
= q22 = q ∗ (σw , σb ) in (15) and dividing by q ∗ to obtain a
recursion relation for cl12 :
1 ∗ ∗
cl12 = ∗ C(cl−112 , q , q | σw , σb ) (16)
q
Direct calculation reveals that cl12 (1) = 1 as expected. Of particular interest is the slope χ1 of this
map at 1. A direct, if tedious calculation shows that
∂cl12
Z
= σw
2
Dz1 Dz2 φ0 (u1 ) φ0 (u2 ) . (17)
∂cl−1
12
To obtain this result, one has to apply the chain rule and product rule from calculus, as well as employ
the identity Z Z
DzF (z)z = DzF 0 (z), (18)
which can be obtained via integration by parts. Evaluating the derivative at 1 yields
∂cl12 √ ∗ 2
Z
2
Dz φ0
χ1 ≡ l−1 = σw q z . (19)
∂c12 c=1
Consider a translation invariant manifold, or 1D curve h(θ) ∈ RN that is on some constant radius
sphere so that
q(θ1 , θ2 ) = Q(θ1 − θ2 ) = h(θ1 ) · h(θ2 ), (20)
∗
with Q(0) = N q . At large N , the inner-product structure of translation invariant manifolds remains
approximately translation invariant as it propagates through the network. Therefore, at large N ,
we can express inner products of derivatives of h in terms of derivatives of Q. For example, the
Euclidean metric g E is given by
g E (θ) = ∂θ h(θ) · ∂θ h(θ) = −Q̈(0). (21)
12
Here, each dot is a short hand notation for derivative w.r.t. θ. Also, the extrinsic curvature
s
(v · v)(a · a) − (v · a)2
κ(θ) = , (22)
(v · v)3
Assume H 1 (∆t) is an even function and H 1 (0) = 0, so that its Taylor expansion can be written
....1
as H 1 (∆t) = 12 Ḧ 1 (0)∆t2 + 41 H (0)∆t4 + . . . . We are interested in determining how the second
and fourth derivatives of H propagate under composition with another function G, so that H 2 =
G(H 1 (∆t)) . We assume G(0) = 0. We can use the chain rule and the product rule to derive:
Ḧ 2 (0) = Ġ(0)Ḧ 1 (0) (24)
....2 1 2 ....1
H (0) = 3G̈(0)Ḧ (0) + Ġ(0) H (0). (25)
We now apply the above iterations with H 1 (θ) = 1−c(θ) and G(c) = 1− q1∗ C(1−c, q ∗ , q ∗ | σw , σb ).
Clearly, G(0) = 0 the symmetric H 1 obeys H 1 (0) = 0, satisfying the above iterations of second and
fourth derivatives. Taking into account these derivative recursions, using the expressions for κ and
g E in terms of derivatives of c(θ) at 0, and carefully accounting for factors of q ∗ and N , we obtain
the final evolution equations that have been successfully tested against experiments:
ḡ E,l = χ1 ḡ E,l−1 (26)
l 2 χ2 1 l−1 2
(κ̄ ) = 3 2+ (κ̄ ) , (27)
χ1 χ1
where χ1 is the stretch factor defined in (19) and χ2 is defined analogously as
√ ∗ 2
Z
2
Dz φ00
χ2 = σw q z . (28)
χ2 is closely related to the second derivative of the correlation coefficient map in (16) at cl−1
12 = 1.
Indeed this second derivative is χ2 q ∗ .
13
C.1 Upper bound on Euclidean length
Here, we derive such an upper bound on the Euclidean length for a very general class of nonlinearities
φ(h). We simply assume that (1) φ(h) is monotonically non-decreasing (so that φ0 (h) ≥ 0∀h) and
(2) has with bounded dynamic range R, i.e. maxh φ(h) − minh φ(h) = R. The Euclidean length in
hidden space is
v
Z u N1 N1 Z
2
uX X
E
dθ ∂θ x1i (θ) ,
L = dθ t (∂θ x1i (θ)) ≤ (29)
i=1 i=1
where the inequality follows from the triangle inequality. Now suppose that for any i, ∂θ x1i (θ) never
changes sign across θ. Furthermore, assume that θ ranges from 0 to Θ. Then
Z Θ
dθ ∂θ x1i (θ) = x1i (Θ) − x1i (0) ≤ max φ(h) − min φ(h) = R.
(30)
0 h h
More generally, let r1 denote the maximal number of times that any one neuron has a change in sign
of the derivative ∂θ x1i (θ) across θ. Then applying the above argument to each segment of constant
sign yields
Z Θ
dθ ∂θ x1i (θ) ≤ (1 + r1 )R.
(31)
0
Now how many times can ∂θ x1i (θ) change sign? Since ∂θ x1i (θ) = φ0 (hi ) ∂θ hi , where ∂θ hi (θ) =
[Wl ∂θ x0 (θ)]i , and φ(hi ) is monotonically increasing, the number of times ∂θ x1i (θ) changes sign
equals the number of times the input ∂θ hi (θ) changes sign. In turn, suppose s0 is the maximal
number of times any one dimensional projection of the derivative vector ∂θ x0 (θ) changes sign across
θ. Then the number of times the sign of ∂θ hi (θ) changes for any i cannot exceed s0 because hi is a
linear projection of x0 . Together this implies r1 ≤ s0 . We have thus proven:
LE ≤ N1 (1 + s0 )R. (32)
D Simulation details
All neural network simulations were implemented in Keras and Theano. For all simulations (except
Figure 5C), we used inputs and hidden layers with a width of 1,000 and tanh activations. We found
that our results were mostly insensitive to width, but using larger widths decreased the fluctuations in
the averaged quantities. Simulation error bars are all standard deviations, with the variance computed
across the different inputs, h1 (θ). If not mentioned, the weights in the network are initialized in the
chaotic regime with σb = 0.3, σw = 4.0.
Computing κ(θ) requires the computation of the velocity and acceleration vectors, corresponding to
the first and second derivatives of the neural network hl (θ) with respect to θ. As θ is always one-
dimensional, we can greatly speed up these computations by using forward-mode auto-differentiation,
evaluating the Jacobian and Hessian in a feedforward manner. We implemented this using the R-op
in Theano.
To identify the curvature of the decision boundary, we first had to identify points that lied along the
decision boundary. We randomly initialized data points and then optimized G(xD (xl ))2 with respect
to the input x using Adam. This yields a set of inputs xl where we compute the Jacobian and Hessian
of G(xD (xl )) to evaluate principal curvatures.
To evaluate the set of functions reachable by a network, we first parameterized function space using a
Fourier basis up to a particular maximum frequency, ωmax on a sampled set of one dimensional inputs
parameterized by θ. We then took the output activations of each neural network and linearly regressed
the output activations onto each Fourier basis. For each basis, we computed the angle between the
14
predicted basis vector and the true basis vector. These are the quantities that appear in Figure 5C-D.
Given any function with bounded frequency, we can represent it in this Fourier basis, and decompose
the error in the prediction of the function into the error in prediction of each Fourier component. Thus
error in the predicting the Fourier basis is a reasonable proxy for error in prediction of functions with
bounded frequency.
15
originates in the weights in layer l = 2, chosen as
p p
Wl (∆) = 1 − |∆| W + |∆| dW. (33)
Here both a base matrix W and a perturbation matrix dW have matrix elements that are zero mean
2
i.i.d. Gaussians with variance σw . Each matrix element of W2 (∆) thus also has variance σw2
just
like all the other layers. In turn, this family of networks induces a family of functions h (h1 , ∆).
D
For simplicity, we restrict these functions to a simple input manifold, the circle,
p
h1 (θ) = N1 q ∗ u0 cos(θ) + u1 sin(θ) ,
(34)
as considered previously. This circle is at the fixed point radius q ∗ (σw , σb ), and the family of networks
induces a family of functions from the circle to the hidden representation space in layer l, namely
RNl . We denote these functions by hl (θ, ∆). How similar are these functions as ∆ changes? This
can be quantified through the correlation in function space
Z ND
l dθ 1 X
Q (∆1 , ∆2 ) ≡ hl (θ, ∆1 )hli (θ, ∆2 ), (35)
2π ND i=1 i
where C l (∆) = Ql (0, ∆)/q ∗ . The initial condition for this recursion is C 1 (∆) = 1, since the family
of functions in the first layer of inputs is independent of ∆. Now, the difference in weights at a
nonzero ∆ reduces the function space correlation to C 2 (∆) < 1. At this point, the representation in
h2 is different for the two networks at parameter values 0 and ∆. Moreover, in the chaotic regime,
this difference will amplify due to the similarity between the function space evolution equation in
(37) and the evolution equation for the similarity of two points in (15). In essence, just as two points
in the input exponentially separate as they propagate through a single network in the chaotic regime,
a pair of different functions separate when computed in the final layer. Thus a small perturbation
in the weights into layer 2 can yield a very large change in the space of functions from the input
manifold to layer D. Moreover, as ∆ varies from -1 to 1, the function hD (θ, ∆) roughly undergoes
a random walk in function space whose autocorrelation length decreases exponentially with depth
D. This weight chaos, or a sensitive dependence of the function computed by a deep network with
respect to weight changes far from the final layer, is another manifestation of deep neural expressivity.
Our companion paper [16] further explores the expressivity of deep random networks in function
space and also finds an exponential growth in expressivity with depth.
16