Group Equivariant Convolutional Networks: Taco S. Cohen

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Group Equivariant Convolutional Networks

Taco S. Cohen T. S . COHEN @ UVA . NL


University of Amsterdam
Max Welling M . WELLING @ UVA . NL
University of Amsterdam
arXiv:1602.07576v3 [cs.LG] 3 Jun 2016

University of California Irvine


Canadian Institute for Advanced Research

Abstract Convolution layers can be used effectively in a deep net-


work because all the layers in such a network are trans-
We introduce Group equivariant Convolutional lation equivariant: shifting the image and then feeding
Neural Networks (G-CNNs), a natural general- it through a number of layers is the same as feeding the
ization of convolutional neural networks that re- original image through the same layers and then shifting
duces sample complexity by exploiting symme- the resulting feature maps (at least up to edge-effects). In
tries. G-CNNs use G-convolutions, a new type of other words, the symmetry (translation) is preserved by
layer that enjoys a substantially higher degree of each layer, which makes it possible to exploit it not just
weight sharing than regular convolution layers. in the first, but also in higher layers of the network.
G-convolutions increase the expressive capacity
of the network without increasing the number of In this paper we show how convolutional networks can be
parameters. Group convolution layers are easy generalized to exploit larger groups of symmetries, includ-
to use and can be implemented with negligible ing rotations and reflections. The notion of equivariance is
computational overhead for discrete groups gen- key to this generalization, so in section 2 we will discuss
erated by translations, reflections and rotations. this concept and its role in deep representation learning.
G-CNNs achieve state of the art results on CI- After discussing related work in section 3, we recall a num-
FAR10 and rotated MNIST. ber of mathematical concepts in section 4 that allow us to
define and analyze the G-convolution in a generic manner.
In section 5, we analyze the equivariance properties of stan-
1. Introduction dard CNNs, and show that they are equivariant to trans-
lations but may fail to equivary with more general trans-
Deep convolutional neural networks (CNNs, convnets) formations. Using the mathematical framework from sec-
have proven to be very powerful models of sensory data tion 4, we can define G-CNNs (section 6) by analogy to
such as images, video, and audio. Although a strong the- standard CNNs (the latter being the G-CNN for the transla-
ory of neural network design is currently lacking, a large tion group). We show that G-convolutions, as well as var-
amount of empirical evidence supports the notion that both ious kinds of layers used in modern CNNs, such as pool-
convolutional weight sharing and depth (among other fac- ing, arbitrary pointwise nonlinearities, batch normalization
tors) are important for good predictive performance. and residual blocks are all equivariant, and thus compatible
Convolutional weight sharing is effective because there is with G-CNNs. In section 7 we provide concrete implemen-
a translation symmetry in most perception tasks: the la- tation details for group convolutions.
bel function and data distribution are both approximately In section 8 we report experimental results on MNIST-rot
invariant to shifts. By using the same weights to analyze and CIFAR10, where G-CNNs achieve state of the art re-
or model each part of the image, a convolution layer uses sults (2.28% error on MNIST-rot, and 4.19% resp. 6.46%
far fewer parameters than a fully connected one, while pre- on augmented and plain CIFAR10). We show that replac-
serving the capacity to learn many useful transformations. ing planar convolutions with G-convolutions consistently
improves results without additional tuning. In section 9 we
Proceedings of the 33 rd International Conference on Machine
Learning, New York, NY, USA, 2016. JMLR: W&CP volume provide a discussion of these results and consider several
48. Copyright 2016 by the author(s). extensions of the method, before concluding in section 10.
Group Equivariant Convolutional Networks

2. Structured & Equivariant Representations 3. Related Work


Deep neural networks produce a sequence of progressively There is a large body of literature on invariant represen-
more abstract representations by mapping the input through tations. Invariance can be achieved by pose normalization
a series of parameterized functions (LeCun et al., 2015). In using an equivariant detector (Lowe, 2004; Jaderberg et al.,
the current generation of neural networks, the representa- 2015) or by averaging a possibly nonlinear function over
tion spaces are usually endowed with very minimal internal a group (Reisert, 2008; Skibbe, 2013; Manay et al., 2006;
structure, such as that of a linear space Rn . Kondor, 2007).
In this paper we construct representations that have the Scattering convolution networks use wavelet convolutions,
structure of a linear G-space, for some chosen group G. nonlinearities and group averaging to produce stable in-
This means that each vector in the representation space has variants (Bruna & Mallat, 2013). Scattering networks have
a pose associated with it, which can be transformed by the been extended to use convolutions on the group of trans-
elements of some group of transformations G. This addi- lations, rotations and scalings, and have been applied
tional structure allows us to model data more efficiently: A to object and texture recognition (Sifre & Mallat, 2013;
filter in a G-CNN detects co-occurrences of features that Oyallon & Mallat, 2015).
have the preferred relative pose, and can match such a fea-
A number of recent works have addressed the prob-
ture constellation in every global pose through an operation
lem of learning or constructing equivariant representa-
called the G-convolution.
tions. This includes work on transforming autoencoders
A representation space can obtain its structure from other (Hinton et al., 2011), equivariant Boltzmann machines
representation spaces to which it is connected. For this to (Kivinen & Williams, 2011; Sohn & Lee, 2012), equivari-
work, the network or layer Φ that maps one representation ant descriptors (Schmidt & Roth, 2012), and equivariant
to another should be structure preserving. For G-spaces filtering (Skibbe, 2013).
this means that Φ has to be equivariant:
Lenc & Vedaldi (2015) show that the AlexNet CNN
(Krizhevsky et al., 2012) trained on imagenet sponta-
Φ(Tg x) = Tg′ Φ(x), (1) neously learns representations that are equivariant to flips,
scaling and rotation. This supports the idea that equivari-
That is, transforming an input x by a transformation g ance is a good inductive bias for deep convolutional net-
(forming Tg x) and then passing it through the learned map works. Agrawal et al. (2015) show that useful representa-
Φ should give the same result as first mapping x through Φ tions can be learned in an unsupervised manner by training
and then transforming the representation. a convolutional network to be equivariant to ego-motion.
Equivariance can be realized in many ways, and in particu- Anselmi et al. (2014; 2015) use the theory of locally com-
lar the operators T and T ′ need not be the same. The only pact topological groups to develop a theory of statistically
requirement for T and T ′ is that for any two transforma- efficient learning in sensory cortex. This theory was imple-
tions g and h, we have T (gh) = T (g)T (h) (i.e. T is a mented for the commutative group consisting of time- and
linear representation of G). vocal tract length shifts for an application to speech recog-
From equation 1 we see that the familiar concept of in- nition by Zhang et al. (2015).
variance is a special kind of equivariance where Tg′ is the Gens & Domingos (2014) proposed an approximately
identity transformation for all g. In deep learning, general equivariant convolutional architecture that uses sparse,
equivariance is more useful than invariance because it is high-dimensional feature maps to deal with high-
impossible to determine if features are in the right spatial dimensional groups of transformations. Dieleman et al.
configuration if they are invariant. (2015) showed that rotation symmetry can be exploited in
Besides improving statistical efficiency and facilitating ge- convolutional networks for the problem of galaxy morphol-
ometrical reasoning, equivariance to symmetry transforma- ogy prediction by rotating feature maps, effectively learn-
tions constrains the network in a way that can aid general- ing an equivariant representation. This work was later ex-
ization. A network Φ can be non-injective, meaning that tended (Dieleman et al., 2016) and evaluated on various
non-identical vectors x and y in the input space become computer vision problems that have cyclic symmetry.
identical in the output space (for example, two instances Cohen & Welling (2014) showed that the concept of disen-
of a face may be mapped onto a single vector indicating tangling can be understood as a reduction of the operators
the presence of any face). If Φ is equivariant, then the G- Tg in an equivariant representation, and later related this
transformed inputs Tg x and Tg y must also be mapped to notion of disentangling to the more familiar statistical no-
the same output. Their “sameness” (as judged by the net- tion of decorrelation (Cohen & Welling, 2015).
work) is preserved under symmetry transformations.
Group Equivariant Convolutional Networks

4. Mathematical Framework equations are cumbersome. Hence, our preferred method


of composing two group elements represented by integer
In this section we present a mathematical framework that tuples is to convert them to matrices, multiply these matri-
enables a simple and generic definition and analysis of G- ces, and then convert the resulting matrix back to a tuple of
CNNs for various groups G. We begin by defining sym- integers (using the atan2 function to obtain r).
metry groups, and study in particular two groups that are
used in the G-CNNs we have built so far. Then we take a The group p4 acts on points in Z2 (pixel coordinates) by
look at functions on groups (used to model feature maps in multiplying the matrix g(r, u, v) by the homogeneous co-
G-CNNs) and their transformation properties. ordinate vector x(u′ , v ′ ) of a point (u′ , v ′ ):

cos(rπ/2) − sin(rπ/2) u u′
  
4.1. Symmetry Groups
gx ≃  sin(rπ/2) cos(rπ/2) v   v ′  (3)
A symmetry of an object is a transformation that leaves 0 0 1 1
the object invariant. For example, if we take the sampling
grid of our image, Z2 , and flip it over we get −Z2 =
4.3. The group p4m
{(−n, −m) | (n, m) ∈ Z2 } = Z2 . So the flipping oper-
ation is a symmetry of the sampling grid. The group p4m consists of all compositions of translations,
If we have two symmetry transformations g and h and we mirror reflections, and rotations by 90 degrees about any
compose them, the result gh is another symmetry transfor- center of rotation in the grid. Like p4, we can parameterize
mation (i.e. it leaves the object invariant as well). Further- this group by integers:
more, the inverse g −1 of any symmetry is also a symmetry,
(−1)m cos( rπ m rπ
 
and composing it with g gives the identity transformation 2 ) −(−1) sin( 2 ) u
g(m, r, u, v) =  sin( rπ
2 ) cos( rπ
2 ) v ,
e. A set of transformations with these properties is called a
0 0 1
symmetry group.
One simple example of a group is the set of 2D integer where m ∈ {0, 1}, 0 ≤ r < 4 and (u, v) ∈ Z2 . The reader
translations, Z2 . Here the group operation (“composition may verify that this is indeed a group.
of transformations”) is addition: (n, m) + (p, q) = (n +
Again, composition is most easily performed using the ma-
p, m + q). One can verify that the sum of two translations
trix representation. Computing r, u, v from a given matrix
is again a translation, and that the inverse (negative) of a
g can be done using the same method we use for p4, and
translation is a translation, so this is indeed a group.
for m we have m = 21 (1 − det(g)).
Although it may seem fancy to call 2-tuples of integers a
group, this is helpful in our case because as we will see in 4.4. Functions on groups
section 6, a useful notion of convolution can be defined for
functions on any group1, of which Z2 is only one exam- We model images and stacks of feature maps in a conven-
ple. The important properties of the convolution, such as tional CNN as functions f : Z2 → RK supported on a
equivariance, arise primarily from the group structure. bounded (typically rectangular) domain. At each pixel co-
ordinate (p, q) ∈ Z2 , the stack of feature maps returns a
4.2. The group p4 K-dimensional vector f (p, q), where K denotes the num-
ber of channels.
The group p4 consists of all compositions of translations
Although the feature maps must always be stored in finite
and rotations by 90 degrees about any center of rotation in
arrays, modeling them as functions that extend to infinity
a square grid. A convenient parameterization of this group
(while being non-zero on a finite region only) simplifies
in terms of three integers r, u, v is
the mathematical analysis of CNNs.
 
cos (rπ/2) − sin(rπ/2) u We will be concerned with transformations of the feature
g(r, u, v) =  sin(rπ/2) cos(rπ/2) v  , (2) maps, so we introduce the following notation for a trans-
0 0 1 formation g acting on a set of feature maps:

where 0 ≤ r < 4 and (u, v) ∈ Z2 . The group operation is [Lg f ](x) = [f ◦ g −1 ](x) = f (g −1 x) (4)
given by matrix multiplication.
The composition and inversion operations could also be Computationally, this says that to get the value of the g-
represented directly in terms of integers (r, u, v), but the transformed feature map Lg f at the point x, we need to do
a lookup in the original feature map f at the point g −1 x,
1
At least, on any locally compact group. which is the unique point that gets mapped to x by g. This
Group Equivariant Convolutional Networks

operator Lg is a concrete instantiation of the transformation m m

operator Tg referenced in section 2, and one may verify that

Lg Lh = Lgh . (5) mr3 e mr mr3 e mr


= r r3 = = r r3 =
rm r 2 r3 m rm r 2 r3 m

If g represents a pure translation t = (u, v) ∈ Z2 then


g −1 x simply means x − t. The inverse on g in equation 4
r2 m = mr2 r2 m = mr2
ensures that the function is shifted in the positive direction
when using a positive translation, and that Lg satisfies the
criterion for being a homomorphism (eq. 5) even for trans- Figure 2. A p4m feature map and its rotation by r.
formations g and h that do not commute (i.e. gh 6= hg).
As will be explained in section 6.1, feature maps in a G- This rich transformation structure arises from the group op-
CNN are functions on the group G, instead of functions on eration of p4 or p4m, combined with equation 4 which de-
the group Z2 . For functions on G, the definition of Lg is scribes the transformation of a function on a group.
still valid if we simply replace x (an element of Z2 ) by h Finally, we define the involution of a feature map, which
(an element of G), and interpret g −1 h as composition. will appear in section 6.1 when we study the behavior of
It is easy to mentally visualize a planar feature map f : the G-convolution, and which also appears in the gradient
Z2 → R undergoing a transformation, but we are not used of the G-convolution. We have:
to visualizing functions on groups. To visualize a feature
f ∗ (g) = f (g −1 ) (6)
map or filter on p4, we plot the four patches associated with
the four pure rotations on a circle, as shown in figure 1 For Z2 feature maps the involution is just a point reflec-
(left). Each pixel in this figure has a rotation coordinate tion, but for G-feature maps the meaning depends on the
(the patch in which the pixel appears), and two translation structure of G. In all cases, f ∗∗ = f .
coordinates (the pixel position within the patch).
5. Equivariance properties of CNNs
In this section we recall the definitions of the convolution
and correlation operations used in conventional CNNs, and
e e
r r3 r r3
show that these operations are equivariant to translations
r 2
r 2 but not to other transformations such as rotation. This is
certainly well known and easy to see by mental visualiza-
tion, but deriving it explicitly will make it easier to follow
the derivation of group equivariance of the group convolu-
tion defined in the next section.
Figure 1. A p4 feature map and its rotation by r.
At each layer l, a regular convnet takes as input a stack of
l

When we apply the 90 degree rotation r to a function on feature maps f : Z2 → RK and convolves or correlates it
l

p4, each planar patch follows its red r-arrow (thus incre- with a set of K l+1 filters ψ i : Z2 → RK :
menting the rotation coordinate by 1 (mod 4)), and simul- K l

taneously undergoes a 90-degree rotation. The result of this


XX
i
[f ∗ ψ ](x) = fk (y)ψki (x − y)
operation is shown on the right of figure 1. As we will see y∈Z2 k=1
in section 6, a p4 feature map in a p4-CNN undergoes ex- l
(7)
K
actly this motion under rotation of the input image. XX
[f ⋆ ψ i ](x) = fk (y)ψki (y − x)
For p4m, we can make a similar plot, shown in figure 2. y∈Z2 k=1
A p4m function has 8 planar patches, each one associated
with a mirroring m and rotation r. Besides red rotation If one employs convolution (∗) in the forward pass, the cor-
arrows, the figure now includes small blue reflection lines relation (⋆) will appear in the backward pass when comput-
(which are undirected, since reflections are self-inverse). ing gradients, and vice versa. We will use the correlation in
the forward pass, and refer generically to both operations
Upon rotation of a p4m function, each patch again follows
as “convolution”.
its red r-arrows and undergoes a 90 degree rotation. Un-
der a mirroring, the patches connected by a blue line will Using the substitution y → y + t, and leaving out the sum-
change places and undergo the mirroring transformation. mation over feature maps for clarity, we see that a transla-
Group Equivariant Convolutional Networks

tion followed by a correlation is the same as a correlation The equivariance of this operation is derived in complete
followed by a translation: analogy to eq. 8, now using the substitution h → uh:
X
[[Lt f ] ⋆ ψ](x) = f (y − t)ψ(y − x)
XX
[[Lu f ] ⋆ ψ](g) = fk (u−1 h)ψ(g −1 h)
y
X h∈G k
= f (y)ψ(y + t − x)
XX
= f (h)ψ(g −1 uh)
y (8) h∈G k (12)
X
= f (y)ψ(y − (x − t))
XX
−1 −1
= f (h)ψ((u g) h)
y
h∈G k
= [Lt [f ⋆ ψ]](x). = [Lu [f ⋆ ψ]](g)
And so we say that “correlation is an equivariant map for
The equivariance of eq. 10 is derived similarly. Note that
the translation group”, or that “correlation and translation
although equivariance is expressed by the same formula
commute”. Using an analogous computation one can show
[Lu f ] ⋆ ψ = Lu [f ⋆ ψ] for both first-layer G-correlation
that also for the convolution, [Lt f ] ∗ ψ = Lt [f ∗ ψ].
(eq. 10) and full G-correlation (11), the meaning of the
Although convolutions are equivariant to translation, they operator Lu is different: for the first layer correlation, the
are not equivariant to other isometries of the sampling lat- inputs f and ψ are functions on Z2 , so Lu f denotes the
tice. For instance, as shown in the supplementary material, transformation of such a function, while Lu [f ⋆ ψ] denotes
rotating the image and then convolving with a fixed filter is the transformation of the feature map, which is a function
not the same as first convolving and then rotating the result: on G. For the full G-correlation, both the inputs f and ψ
[[Lr f ] ⋆ ψ](x) = Lr [f ⋆ [Lr−1 ψ]](x) (9) and the output f ⋆ ψ are functions on G.

In words, this says that the correlation of a rotated image Note that if G is not commutative, neither the G-
Lr f with a filter ψ is the same as the rotation by r of the convolution nor the G-correlation is commutative. How-
original image f convolved with the inverse-rotated filter ever, the feature maps ψ ⋆ f and f ⋆ ψ are related by the
Lr−1 ψ. Hence, if an ordinary CNN learns rotated copies involution (eq. 6):
of the same filter, the stack of feature maps is equivariant, f ⋆ ψ = (ψ ⋆ f )∗ . (13)
although individual feature maps are not.
Since the involution is invertible (it is its own inverse), the
6. Group Equivariant Networks information content of f ⋆ψ and ψ⋆f is the same. However,
f ⋆ ψ is more efficient to compute when using the method
In this section we will define the three layers used in a G- described in section 7, because transforming a small filter
CNN (G-convolution, G-pooling, nonlinearity) and show is faster than transforming a large feature map.
that each one commutes with G-transformations of the do-
main of the image. It is customary to add a bias term to each feature map
in a convolution layer. This can be done for G-conv
6.1. G-Equivariant correlation layers as well, as long as there is only one bias per G-
feature map (instead of one bias per spatial feature plane
The correlation (eq. 7) is computed by shifting a filter and within a G-feature map). Similarly, batch normalization
then computing a dot product with the feature maps. By (Ioffe & Szegedy, 2015) should be implemented with a sin-
replacing the shift by a more general transformation from gle scale and bias parameter per G-feature map in order
some group G, we get the G-correlation used in the first to preserve equivariance. The sum of two G-equivariant
layer of a G-CNN: feature maps is also G-equivariant, thus G-conv layers
XX can be used in highway networks and residual networks
[f ⋆ ψ](g) = fk (y)ψk (g −1 y). (10)
(Srivastava et al., 2015; He et al., 2015).
y∈Z2 k

Notice that both the input image f and the filter ψ are func- 6.2. Pointwise nonlinearities
tions of the plane Z2 , but the feature map f ⋆ψ is a function
on the discrete group G (which may contain translations as Equation 12 shows that G-correlation preserves the trans-
a subgroup). Hence, for all layers after the first, the filters ψ formation properties of the previous layer. What about non-
must also be functions on G, and the correlation operation linearities and pooling?
becomes Recall that we think of feature maps as functions on G. In
this view, applying a nonlinearity ν : R → R to a feature
XX
[f ⋆ ψ](g) = fk (h)ψk (g −1 h). (11)
h∈G k map amounts to function composition. We introduce the
Group Equivariant Convolutional Networks

composition operator a function on the quotient space G/H, in which two trans-
formations are considered equivalent if they are related by
Cν f (g) = [ν ◦ f ](g) = ν(f (g)). (14) a transformation in H.

which acts on functions by post-composing them with ν. As an example, in a p4 feature map, we can pool over all
four rotations at each spatial position (the cosets of the sub-
Since the left transformation operator L acts by pre- group R of rotations around the origin). The resulting fea-
composition, C and L commute: ture map is a function on Z2 ∼ = p4/R, i.e. it will transform
in the same way as the input image. Another example is
Cν Lh f = ν ◦ [f ◦ h−1 ] = [ν ◦ f ] ◦ h−1 = Lh Cν f, (15)
given by a feature map on Z, where we could pool over the
cosets of the subgroup nZ of shifts by multiples of n. This
so the rectified feature map inherits the transformation
gives a feature map on Z/nZ, which has a cyclic transfor-
properties of the previous layer.
mation law under translations.
6.3. Subgroup pooling and coset pooling This concludes our analysis of G-CNNs. Since all layer
types are equivariant, we can freely stack them into deep
In order to simplify the analysis, we split the pooling op- networks and expect G-conv parameter sharing to be effec-
eration into two steps: the pooling itself (performed with- tive at arbitrary depth.
out stride), and a subsampling step. The non-strided max-
pooling operation applied to a feature map f : G → R can
be modeled as an operator P that acts on f as 7. Efficient Implementation
Computing the G-convolution for involves nothing more
P f (g) = max f (k), (16)
k∈gU than indexing arithmetic and inner products, so it can
be implemented straightforwardly. Here we present
where gU = {gu | u ∈ U } is the g-transformation of some the details for a G-convolution implementation that can
pooling domain U ⊂ G (typically a neighborhood of the leverage recent advances in fast computation of planar
identity transformation). In a regular convnet, U is usually convolutions (Mathieu et al., 2014; Vasilache et al., 2015;
a 2 × 2 or 3 × 3 square including the origin (0, 0), and g is Lavin & Gray, 2015).
a translation.
A plane symmetry group G is called split if any transfor-
As shown in the supplementary material, pooling com- mation g ∈ G can be decomposed into a translation t ∈ Z2
mutes with Lh : and a transformation s in the stabilizer of the origin (i.e. s
P Lh = Lh P (17) leaves the origin invariant). For the group p4, we can write
g = ts for t a translation and s a rotation about the origin,
Since pooling tends to reduce the variation in a feature map, while p4m splits into translations and rotation-flips. Us-
it makes sense to sub-sample the pooled feature map, or ing this split of G and the fact that Lg Lh = Lgh , we can
equivalently, to do a “pooling with stride”. In a G-CNN, rewrite the G-correlation (eq. 10 and 11) as follows:
the notion of “stride” is generalized by subsampling on a XX
subgroup H ⊂ G. That is, H is a subset of G that is itself a f ⋆ ψ(ts) = fk (h)Lt [Ls ψk (h)] (18)
group (i.e. closed under multiplication and inverses). The h∈X k
subsampled feature map is then equivariant to H but not G. where X = Z2 in layer one and X = G in further layers.
In a standard convnet, pooling with stride 2 is the same as Thus, to compute the p4 (or p4m) correlation f ⋆ ψ we can
pooling and then subsampling on H = {(2i, 2j) |(i, j) ∈ first compute Ls ψ (“filter transformation”) for all four ro-
Z2 } which is a subgroup of G = Z2 . For the p4-CNN, we tations (or all eight rotation-flips) and then call a fast planar
may subsample on the subgroup H containing all 4 rota- correlation routine on f and the augmented filter bank.
tions, as well as shifts by multiples of 2 pixels.
The computational cost of the algorithm presented here is
We can obtain full G-equivariance by choosing our pooling roughly equal to that of a planar convolution with a filter
region U to be a subgroup H ⊂ G. The pooling domains bank that is the same size as the augmented filter bank used
gH that result are called cosets in group theory. The cosets in the G-convolution, because the cost of the filter transfor-
partition the group into non-overlapping regions. The fea- mation is negligible.
ture map that results from pooling over cosets is invariant
to the right-action of H, because the cosets are similarly in-
7.1. Filter transformation
variant (ghH = gH). Hence, we can arbitrarily choose one
coset representative per coset to subsample on. The feature The set of filters at layer l is stored in an array F [·] of shape
map that results from coset pooling may be thought of as K l × K l−1 × S l−1 × n × n, where K l is the number of
Group Equivariant Convolutional Networks

channels at layer l, S l−1 denotes the number of transfor- state of the art, which uses prior knowledge about rotations
mations in G that leave the origin invariant (e.g. 1, 4 or 8 (Schmidt & Roth, 2012) (see table 1).
for Z2 , p4 or p4m filters, respectively), and n is the spa-
Next, we replaced each convolution by a p4-convolution

tial (or translational) extent of the filter. Note that typically,
(eq. 10 and 11), divided the number of filters by 4 =
S 1 = 1 for 2D images, while S l = 4 or S l = 8 for l > 1.
2 (so as to keep the number of parameters approximately
The filter transformation Ls amounts to a permutation of fixed), and added max-pooling over rotations after the last
the entries of each of the K l × K l−1 scalar-valued filter convolution layer. This architecture (P4CNN) was found
channels in F . Since we are applying S l transformations to to perform better without dropout, so we removed it. The
each filter, the output of this operation is an array of shape P4CNN almost halves the error rate of the previous state of
K l × S l × K l−1 × S l−1 × n × n, which we call F + . the art (2.28% vs 3.98% error).
The permutation can be implemented efficiently by a GPU We then tested the hypothesis that premature invariance is
kernel that does a lookup into F for each output cell of undesirable in a deep architecture (section 2). We took
F + , using a precomputed index associated with the output the Z2CNN, replaced each convolution layer by a p4-
cell. To precompute the indices, we define an invertible convolution (eq. 10) followed by a coset max-pooling over
map g(s, u, v) that takes an input index (valid for an array rotations. The resulting feature maps consist of rotation-
of shape S l−1 × n × n) and produces the associated group invariant features, and have the same transformation law as
element g as a matrix (section 4.2 and 4.3). For each in- the input image. This network (P4CNNRotationPooling)
put index (s, u, v) and each transformation s′ , we compute outperforms the baseline and the previous state of the art,
s̄, ū, v̄ = g −1 (g(s′ , 0, 0)−1 g(s, u, v)). This index is used but performs significantly worse than the P4CNN which
to set F + [i, s′ , j, s, u, v] = F [i, j, s̄, ū, v̄] for all i, j. does not pool over rotations in intermediate layers.
The G-convolution for a new group can be added by simply
implementing a map g(·) from indices to matrices. Network Test Error (%)
Larochelle et al. (2007) 10.38 ± 0.27
7.2. Planar convolution Sohn & Lee (2012) 4.2
Schmidt & Roth (2012) 3.98
The second part of the G-convolution algorithm is a pla- Z2CNN 5.03 ± 0.0020
nar convolution using the expanded filter bank F + . If P4CNNRotationPooling 3.21 ± 0.0012
S l−1 > 1, the sum over X in eq. 18 involves a sum over P4CNN 2.28 ± 0.0004
the stabilizer. This sum can be folded into the sum over fea-
ture channels performed by the planar convolution routine Table 1. Error rates on rotated MNIST (with standard deviation
by reshaping F + from K l × S l × K l−1 × S l−1 × n × n to under variation of the random seed).
S l K l ×S l−1 K l−1 ×n×n. The resulting array can be inter-
preted as a conventional filter bank with S l−1 K l−1 planar
input channels and S l K l planar output channels, which can 8.2. CIFAR-10
be correlated with the feature maps f (similarly reshaped). The CIFAR-10 dataset consists of 60k images of size 32 ×
32, divided into 10 classes. The dataset is split into 40k
8. Experiments training, 10k validation and 10k testing splits.

8.1. Rotated MNIST We compared the p4-, p4m- and standard planar Z2
convolutions on two kinds of baseline architectures.
The rotated MNIST dataset (Larochelle et al., 2007) con- Our first baseline is the All-CNN-C architecture by
tains 62000 randomly rotated handwritten digits. The Springenberg et al. (2015), which consists of a sequence of
dataset is split into a training, validation and test sets of 9 strided and non-strided convolution layers, interspersed
size 10000, 2000 and 50000, respectively. with rectified linear activation units, and nothing else. Our
We performed model selection using the validation set, second baseline is a residual network (He et al., 2016),
yielding a CNN architecture (Z2CNN) with 7 layers of which consists of an initial convolution layer, followed by
3 × 3 convolutions (4 × 4 in the final layer), 20 chan- three stages of 2n convolution layers using ki filters at stage
nels in each layer, relu activation functions, batch normal- i, followed by a final classification layer (6n + 2 layers in
ization, dropout, and max-pooling after layer 2. For op- total). The first convolution in each stage i > 1 uses a stride
timization, we used the Adam algorithm (Kingma & Ba, of 2, so the feature map sizes are 32, 16, and 8 for the three
2015). This baseline architecture outperforms the mod- stages. We use n = 7, ki = 32, 64, 128 yielding a wide
els tested by Larochelle et al. (2007) (when trained on 12k 44-layer network called ResNet44.
and evaluated on 50k), but does not match the previous To evaluate G-CNNs, we replaced all convolution layers of
Group Equivariant Convolutional Networks

the baseline architectures by p4 or p4m convolutions. For a 9. Discussion & Future work
constant number of filters, this increases the size of the fea-
ture maps 4 or 8-fold, which in turn increases the number of Our results show that p4 and p4m convolution layers can
parameters required per filter in the next layer. Hence, we be used as a drop-in replacement of standard convolutions
that consistently improves the results.
√ of filters in each p4-conv layer, and divide
halve the number
it by roughly 8 ≈ 3 in each p4m-conv layer. This way, G-CNNs benefit from data augmentation in the same way
the number of parameters is left approximately invariant, as convolutional networks, as long as the augmentation
while the size of the internal representation is increased. comes from a group larger than G. Augmenting with flips
Specifically, we used ki = 11, 23, 45 for p4m-ResNet44. and small translations consistently improves the results for
To evaluate the impact of data augmentation, we compare the p4 and p4m-CNN.
the networks on CIFAR10 and augmented CIFAR10+. The The CIFAR dataset is not actually symmetric, since objects
latter denotes moderate data augmentation with horizon- typically appear upright. Nevertheless, we see substantial
tal flips and small translations, following Goodfellow et al. increases in accuracy on this dataset, indicating that there
(2013) and many others. need not be a full symmetry for G-convolutions to be ben-
The training procedure for training the All-CNN was re- eficial.
produced as closely as possible from Springenberg et al. In future work, we want to implement G-CNNs that work
(2015). For the ResNets, we used stochastic gradient de- on hexagonal lattices which have an increased number of
scent with initial learning rate of 0.05 and momentum 0.9. symmetries relative to square grids, as well as G-CNNs for
The learning rate was divided by 10 at epoch 50, 100 and 3D space groups. All of the theory presented in this paper is
150, and training was continued for 300 epochs. directly applicable to these groups, and the G-convolution
can be implemented in such a way that new groups can
Network G CIFAR10 CIFAR10+ Param. be added by simply specifying the group operation and a
bijective map between the group and the set of indices.
All-CNN Z2 9.44 8.86 1.37M
p4 8.84 7.67 1.37M One limitation of the method as presented here is that
p4m 7.59 7.04 1.22M it only works for discrete groups. Convolution on con-
ResNet44 Z2 9.45 5.61 2.64M tinuous (locally compact) groups is mathematically well-
p4m 6.46 4.94 2.62M defined, but may be hard to approximate in an equivari-
ant manner. A further challenge, already identified by
Table 2. Comparison of conventional (i.e. Z2 ), p4 and p4m CNNs Gens & Domingos (2014), is that a full enumeration of
on CIFAR10 and augmented CIFAR10+. Test set error rates and transformations in a group may not be feasible if the group
number of parameters are reported. is large.
Finally, we hope that the current work can serve as a con-
To the best of our knowledge, the p4m-CNN outperforms crete example of the general philosophy of “structured rep-
all published results on plain CIFAR10 (Wan et al., 2013; resentations”, outlined in section 2. We believe that adding
Goodfellow et al., 2013; Lin et al., 2014; Lee et al., 2015b; mathematical structure to a representation (making sure
Srivastava et al., 2015; Clevert et al., 2015; Lee et al., that maps between representations preserve this structure),
2015a). However, due to radical differences in model sizes could enhance the ability of neural nets to see abstract sim-
and architectures, it is difficult to infer much about the in- ilarities between superficially different concepts.
trinsic merit of the various techniques. It is quite possi-
ble that the cited methods would yield better results when
deployed in larger networks or in combination with other 10. Conclusion
techniques. Extreme data augmentation and model ensem- We have introduced G-CNNs, a generalization of convolu-
bles can also further improve the numbers (Graham, 2014). tional networks that substantially increases the expressive
Inspired by the wide ResNets of Zagoruyko & Komodakis capacity of a network without increasing the number of
(2016), we trained another ResNet with 26 layers and parameters. By exploiting symmetries, G-CNNs achieve
ki = (71, 142, 248) (for planar convolutions) or ki = state of the art results on rotated MNIST and CIFAR10.
(50, 100, 150) (for p4m convolutions). When trained with We have developed the general theory of G-CNNs for dis-
moderate data augmentation, this network achieves an er- crete groups, showing that all layer types are equivariant to
ror rate of 5.27% using planar convolutions, and 4.19% the action of the chosen group G. Our experimental results
with p4m convolutions. This result is comparable to the show that G-convolutions can be used as a drop-in replace-
4.17% error reported by Zagoruyko & Komodakis (2016), ment for spatial convolutions in modern network architec-
but using fewer parameters (7.2M vs 36.5M). tures, improving their performance without further tuning.
Group Equivariant Convolutional Networks

Acknowledgements the 30th International Conference on Machine Learning


(ICML), pp. 1319–1327, 2013.
We would like to thank Joan Bruna, Sander Dieleman,
Robert Gens, Chris Olah, and Stefano Soatto for helpful Graham, B. Fractional Max-Pooling. arXiv:1412.6071,
discussions. This research was supported by NWO (grant 2014.
number NAI.14.108), Google and Facebook.
He, K., Zhang, X., Ren, S., and Sun, J. Deep Resid-
ual Learning for Image Recognition. arXiv:1512.03385,
References 2015.
Agrawal, P., Carreira, J., and Malik, J. Learning to See He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
by Moving. In International Conference on Computer Jian. Identity Mappings in Deep Residual Networks.
Vision (ICCV), 2015. arXiv:1603.05027, 2016.
Anselmi, F., Leibo, J. Z., Rosasco, L., Mutch, J., Tacchetti, Hinton, G. E., Krizhevsky, A., and Wang, S. D. Trans-
A., and Poggio, T. Unsupervised learning of invariant forming auto-encoders. ICANN-11: International Con-
representations with low sample complexity: the magic ference on Artificial Neural Networks, Helsinki, 2011.
of sensory cortex or a new framework for machine learn-
ing? Technical Report 001, MIT Center for Brains, Ioffe, S. and Szegedy, C. Batch Normalization : Acceler-
Minds and Machines, 2014. ating Deep Network Training by Reducing Internal Co-
variate Shift. arXiv:1502.03167v3, 2015.
Anselmi, F., Rosasco, L., and Poggio, T. On Invariance and
Selectivity in Representation Learning. Technical report, Jaderberg, M., Simonyan, K., Zisserman, A., and
MIT Center for Brains, Minds and Machines, 2015. Kavukcuoglu, K. Spatial Transformer Networks. In
Advances in Neural Information Processing Systems 28
Bruna, J. and Mallat, S. Invariant scattering convolu- (NIPS 2015), 2015.
tion networks. IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI), 35(8):1872–86, aug Kingma, D. and Ba, J. Adam: A Method for Stochastic
2013. Optimization. In Proceedings of the International Con-
ference on Learning Representations (ICLR), 2015.
Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and
Kivinen, Jyri J. and Williams, Christopher K I. Transfor-
Accurate Deep Network Learning by Exponential Linear
mation equivariant Boltzmann machines. In 21st Inter-
Units (ELUs). arXiv:1511.07289v3, 2015.
national Conference on Artificial Neural Networks, jun
Cohen, T. and Welling, M. Learning the Irreducible Repre- 2011.
sentations of Commutative Lie Groups. In Proceedings
Kondor, R. A novel set of rotationally and translation-
of the 31st International Conference on Machine Learn-
ally invariant features for images based on the non-
ing (ICML), volume 31, pp. 1755–1763, 2014.
commutative bispectrum. arXiv:0701127, 2007.
Cohen, T. S. and Welling, M. Transformation Properties of Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNet
Learned Visual Representations. International Confer- classification with deep convolutional neural networks.
ence on Learning Representations (ICLR), 2015. Advances in Neural Information Processing Systems, 25,
Dieleman, S., Willett, K. W., and Dambre, J. Rotation- 2012.
invariant convolutional neural networks for galaxy mor- Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and
phology prediction. Monthly Notices of the Royal Astro- Bengio, Y. An empirical evaluation of deep architectures
nomical Society, 450(2), 2015. on problems with many factors of variation. Proceedings
of the 24th International Conference on Machine Learn-
Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Ex-
ing (ICML), 2007.
ploiting Cyclic Symmetry in Convolutional Neural Net-
works. In International Conference on Machine Learn- Lavin, A. and Gray, S. Fast Algorithms for Convolutional
ing (ICML), 2016. Neural Networks. arXiv:1509.09308, 2015.
Gens, R. and Domingos, P. Deep Symmetry Networks. LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na-
In Advances in Neural Information Processing Systems ture, 521(7553):436–444, 2015.
(NIPS), 2014.
Lee, C., Gallagher, P. W., and Tu, Z. Generalizing Pooling
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, Functions in Convolutional Neural Networks: Mixed,
A., and Bengio, Y. Maxout Networks. In Proceedings of Gated, and Tree. ArXiv:1509.08985, 2015a.
Group Equivariant Convolutional Networks

Lee, C., Xie, S., Gallagher, P.W., Zhang, Z., and Tu, Springenberg, J.T., Dosovitskiy, A., Brox, T., and Ried-
Z. Deeply-Supervised Nets. In Proceedings of the miller, M. Striving for Simplicity: The All Convolu-
Eighteenth International Conference on Artificial Intel- tional Net. Proceedings of the International Conference
ligence and Statistics (AISTATS), volume 38, pp. 562– on Learning Representations (ICLR), 2015.
570, 2015b.
Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhu-
Lenc, K. and Vedaldi, A. Understanding image represen- ber, Jürgen. Training Very Deep Networks. Advances in
tations by measuring their equivariance and equivalence. Neural Information Processing Systems (NIPS), 2015.
In Proceedings of the IEEE Conf. on Computer Vision
Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Pi-
and Pattern Recognition (CVPR), 2015.
antino, S., and LeCun, Y. Fast convolutional nets with
Lin, M., Chen, Q., and Yan, S. Network In Network. fbfft: A GPU performance evaluation. In International
International Conference on Learning Representations Conference on Learning Representations (ICLR), 2015.
(ICLR), 2014. Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus,
R. Regularization of neural networks using dropconnect.
Lowe, D.G. Distinctive Image Features from Scale-
International Conference on Machine Learning (ICML),
Invariant Keypoints. International Journal of Computer
pp. 109–111, 2013.
Vision, 60(2):91–110, nov 2004.
Zagoruyko, S. and Komodakis, N. Wide Residual Net-
Manay, Siddharth, Cremers, Daniel, Hong, Byung Woo, works. arXiv:1605.07146, 2016.
Yezzi, Anthony J., and Soatto, Stefano. Integral invari-
ants for shape matching. IEEE Transactions on Pattern Zhang, C., Voinea, S., Evangelopoulos, G., Rosasco, L.,
Analysis and Machine Intelligence, 28(10):1602–1617, and Poggio, T. Discriminative template learning in
2006. ISSN 01628828. doi: 10.1109/TPAMI.2006.208. group-convolutional networks for invariant speech rep-
resentations. InterSpeech, pp. 3229–3233, 2015.
Mathieu, M., Henaff, M., and LeCun, Y. Fast Training of
Convolutional Networks through FFTs. In International
Conference on Learning Representations (ICLR), 2014.

Oyallon, E. and Mallat, S. Deep Roto-Translation Scat-


tering for Object Classification. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp.
2865—-2873, 2015.

Reisert, Marco. Group Integration Techniques in Pattern


Analysis. PhD thesis, Albert-Ludwigs-University, 2008.

Schmidt, U. and Roth, S. Learning rotation-aware fea-


tures: From invariant priors to equivariant descriptors.
Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR),
2012.

Sifre, Laurent and Mallat, Stephane. Rotation, Scaling and


Deformation Invariant Scattering for Texture Discrimi-
nation. IEEE conference on Computer Vision and Pat-
tern Recognition (CVPR), 2013.

Skibbe, H. Spherical Tensor Algebra for Biomedical Im-


age Analysis. PhD thesis, Albert-Ludwigs-Universitat
Freiburg im Breisgau, 2013.

Sohn, K. and Lee, H. Learning Invariant Representations


with Local Transformations. Proceedings of the 29th
International Conference on Machine Learning (ICML-
12), 2012.
Group Equivariant Convolutional Networks

Appendix A: Equivariance Derivations Appendix B: Gradients


We claim in the paper that planar correlation is not equiv- To train a G-CNN, we need to compute gradients of a loss
ariant to rotations. Let f : R2 → RK be an image with K function with respect to the parameters of the filters. If we
channels, and let ψ : R2 → RK be a filter. Take a rotation use the fast algorithm explained in section 7 of the main pa-
r about the origin. The ordinary planar correlation ⋆ is not per, we only have to implement the gradient of the indexing
equivariant to rotations, i.e., [Lr f ] ⋆ ψ 6= Lr [f ⋆ ψ]. Instead operation (section 7.1, “filter transformation”), because the
we have: 2D convolution routine and its gradient are given.
XX
[[Lr f ] ⋆ ψ](x) = Lr fk (y)ψk (y − x) This gradient is computed as follows. The gradient of the
y∈Z2 k loss with respect to cell i in the input of the indexing oper-
XX ation is the sum of the gradients of the output cells j that
= fk (r−1 y)ψk (y − x)
index cell i. On current GPU hardware, this can be im-
y∈Z2 k
XX plemented efficiently using a kernel that is instantiated for
= fk (y)ψk (ry − x) each cell j in the output array. The kernel adds the value
y∈Z2 k of the gradient of the loss with respect to cell j to cell i of
XX the array that holds the gradient of the loss with respect to
= fk (y)ψk (r(y − r−1 x))
the input of the indexing operation (this array is to be ini-
y∈Z2 k
XX tialized at zero). Since multiple kernels write to the same
= fk (y)Lr−1 ψ(y − r−1 x)) cell i, the additions must be done using atomic operations
y∈Z2 k to avoid concurrency problems.
= f ⋆ [Lr−1 ψ](r−1 x) Alternatively, one could implement the filter transforma-
= Lr [f ⋆ [Lr−1 ψ]](x) tion using a precomputed permutation matrix. This is not
(19) as efficient, but the gradient is trivial, and most computa-
Line by line, we used the following definitions, facts and tion graph / deep learning packages will have implemented
manipulations: the matrix multiplication and its gradient.

1. The definition of the correlation ⋆. Appendix C: G-conv calculus


2. The definition of Lr , i.e. Lr f (x) = f (r−1 x). Although the gradient of the filter transformation operation
is all that is needed to do backpropagation in a G-CNN
3. The substitution y → ry, which does not change the
for a split group G, it is instructive to derive the analytical
summation bounds since rotation is a symmetry of the
gradients of the G-correlation operation. This leads to an
sampling grid Z2 .
elegant “G-conv calculus”, included here for the interested
4. Distributivity. reader.

5. The definition of Lr . Let feature map k at layer l be denoted fkl = f l−1 ⋆ ψ lk ,


where f l−1 is the stack of feature maps in the previous
6. The definition of the correlation ⋆. layer. At some point in the backprop algorithm, we will
7. The definition of Lr . have computed the derivative ∂L/∂fkl for all k, and we
need to compute ∂L/∂fjl−1 (to backpropagate to lower lay-
A visual proof can be found in (Dieleman et al., 2016). ers) as well as ∂L/∂ψjlk (to update the parameters). We
find that,
Using a similar line of reasoning, we can show that pooling
∂L X ∂L ∂fkl (h)
commutes with the group action: =
∂fjl−1 (g) h,k
∂fkl (h) ∂fjl−1 (g)
P Lh f (g) = max Lh f (k)  
k∈gU
X ∂f l−1 ′
k′ (h ) lk −1 ′ 
X ∂L
= max f (h−1 k) = l (h)

l−1
ψk′ (h h )
k∈gU
h,k
∂f k h′ ,k′
∂f j (g)
= max f (k) ∂L
hk∈gU (20)
X
= ψ lk (h−1 g)
l (g) j
= max f (k) h,k
∂f k
k∈h−1 gU  
∂L
= P f (h−1 g) = l∗
⋆ ψj (g)
∂f l
= Lh P f (g) (21)
Group Equivariant Convolutional Networks

where the superscript ∗ denotes the involution

ψ ∗ (g) = ψ(g −1 ), (22)

and ψjl is the set of filter components applied to input fea-


ture map j at layer l:

ψjl (g) = (ψjl1 (g), . . . , ψjlKl (g)) (23)

To compute the gradient with respect to component j of


filter k, we have to G-convolve the j-th input feature map
with the k-th output feature map:

∂L X ∂L ∂f l (h)
k
lk
= l (h) ∂ψ lk (g)
∂ψj (g) h
∂f k j
 
X ∂L lk −1 ′
X
l−1 ∂ψ (h h )
fk′ (h′ ) k lk

= l (h)
 
h
∂f k h ,k
′ ′
∂ψ j (g)
X ∂L
= f l−1 (hg)
l (h) j
h
∂f k
 
∂L l−1
= ∗ fj (g)
∂fkl
(24)
So we see that both the forward and backward passes in-
volve convolution or correlation operations, as is the case
in standard convnets.

You might also like