Group Equivariant Convolutional Networks: Taco S. Cohen
Group Equivariant Convolutional Networks: Taco S. Cohen
Group Equivariant Convolutional Networks: Taco S. Cohen
cos(rπ/2) − sin(rπ/2) u u′
4.1. Symmetry Groups
gx ≃ sin(rπ/2) cos(rπ/2) v v ′ (3)
A symmetry of an object is a transformation that leaves 0 0 1 1
the object invariant. For example, if we take the sampling
grid of our image, Z2 , and flip it over we get −Z2 =
4.3. The group p4m
{(−n, −m) | (n, m) ∈ Z2 } = Z2 . So the flipping oper-
ation is a symmetry of the sampling grid. The group p4m consists of all compositions of translations,
If we have two symmetry transformations g and h and we mirror reflections, and rotations by 90 degrees about any
compose them, the result gh is another symmetry transfor- center of rotation in the grid. Like p4, we can parameterize
mation (i.e. it leaves the object invariant as well). Further- this group by integers:
more, the inverse g −1 of any symmetry is also a symmetry,
(−1)m cos( rπ m rπ
and composing it with g gives the identity transformation 2 ) −(−1) sin( 2 ) u
g(m, r, u, v) = sin( rπ
2 ) cos( rπ
2 ) v ,
e. A set of transformations with these properties is called a
0 0 1
symmetry group.
One simple example of a group is the set of 2D integer where m ∈ {0, 1}, 0 ≤ r < 4 and (u, v) ∈ Z2 . The reader
translations, Z2 . Here the group operation (“composition may verify that this is indeed a group.
of transformations”) is addition: (n, m) + (p, q) = (n +
Again, composition is most easily performed using the ma-
p, m + q). One can verify that the sum of two translations
trix representation. Computing r, u, v from a given matrix
is again a translation, and that the inverse (negative) of a
g can be done using the same method we use for p4, and
translation is a translation, so this is indeed a group.
for m we have m = 21 (1 − det(g)).
Although it may seem fancy to call 2-tuples of integers a
group, this is helpful in our case because as we will see in 4.4. Functions on groups
section 6, a useful notion of convolution can be defined for
functions on any group1, of which Z2 is only one exam- We model images and stacks of feature maps in a conven-
ple. The important properties of the convolution, such as tional CNN as functions f : Z2 → RK supported on a
equivariance, arise primarily from the group structure. bounded (typically rectangular) domain. At each pixel co-
ordinate (p, q) ∈ Z2 , the stack of feature maps returns a
4.2. The group p4 K-dimensional vector f (p, q), where K denotes the num-
ber of channels.
The group p4 consists of all compositions of translations
Although the feature maps must always be stored in finite
and rotations by 90 degrees about any center of rotation in
arrays, modeling them as functions that extend to infinity
a square grid. A convenient parameterization of this group
(while being non-zero on a finite region only) simplifies
in terms of three integers r, u, v is
the mathematical analysis of CNNs.
cos (rπ/2) − sin(rπ/2) u We will be concerned with transformations of the feature
g(r, u, v) = sin(rπ/2) cos(rπ/2) v , (2) maps, so we introduce the following notation for a trans-
0 0 1 formation g acting on a set of feature maps:
where 0 ≤ r < 4 and (u, v) ∈ Z2 . The group operation is [Lg f ](x) = [f ◦ g −1 ](x) = f (g −1 x) (4)
given by matrix multiplication.
The composition and inversion operations could also be Computationally, this says that to get the value of the g-
represented directly in terms of integers (r, u, v), but the transformed feature map Lg f at the point x, we need to do
a lookup in the original feature map f at the point g −1 x,
1
At least, on any locally compact group. which is the unique point that gets mapped to x by g. This
Group Equivariant Convolutional Networks
When we apply the 90 degree rotation r to a function on feature maps f : Z2 → RK and convolves or correlates it
l
p4, each planar patch follows its red r-arrow (thus incre- with a set of K l+1 filters ψ i : Z2 → RK :
menting the rotation coordinate by 1 (mod 4)), and simul- K l
tion followed by a correlation is the same as a correlation The equivariance of this operation is derived in complete
followed by a translation: analogy to eq. 8, now using the substitution h → uh:
X
[[Lt f ] ⋆ ψ](x) = f (y − t)ψ(y − x)
XX
[[Lu f ] ⋆ ψ](g) = fk (u−1 h)ψ(g −1 h)
y
X h∈G k
= f (y)ψ(y + t − x)
XX
= f (h)ψ(g −1 uh)
y (8) h∈G k (12)
X
= f (y)ψ(y − (x − t))
XX
−1 −1
= f (h)ψ((u g) h)
y
h∈G k
= [Lt [f ⋆ ψ]](x). = [Lu [f ⋆ ψ]](g)
And so we say that “correlation is an equivariant map for
The equivariance of eq. 10 is derived similarly. Note that
the translation group”, or that “correlation and translation
although equivariance is expressed by the same formula
commute”. Using an analogous computation one can show
[Lu f ] ⋆ ψ = Lu [f ⋆ ψ] for both first-layer G-correlation
that also for the convolution, [Lt f ] ∗ ψ = Lt [f ∗ ψ].
(eq. 10) and full G-correlation (11), the meaning of the
Although convolutions are equivariant to translation, they operator Lu is different: for the first layer correlation, the
are not equivariant to other isometries of the sampling lat- inputs f and ψ are functions on Z2 , so Lu f denotes the
tice. For instance, as shown in the supplementary material, transformation of such a function, while Lu [f ⋆ ψ] denotes
rotating the image and then convolving with a fixed filter is the transformation of the feature map, which is a function
not the same as first convolving and then rotating the result: on G. For the full G-correlation, both the inputs f and ψ
[[Lr f ] ⋆ ψ](x) = Lr [f ⋆ [Lr−1 ψ]](x) (9) and the output f ⋆ ψ are functions on G.
In words, this says that the correlation of a rotated image Note that if G is not commutative, neither the G-
Lr f with a filter ψ is the same as the rotation by r of the convolution nor the G-correlation is commutative. How-
original image f convolved with the inverse-rotated filter ever, the feature maps ψ ⋆ f and f ⋆ ψ are related by the
Lr−1 ψ. Hence, if an ordinary CNN learns rotated copies involution (eq. 6):
of the same filter, the stack of feature maps is equivariant, f ⋆ ψ = (ψ ⋆ f )∗ . (13)
although individual feature maps are not.
Since the involution is invertible (it is its own inverse), the
6. Group Equivariant Networks information content of f ⋆ψ and ψ⋆f is the same. However,
f ⋆ ψ is more efficient to compute when using the method
In this section we will define the three layers used in a G- described in section 7, because transforming a small filter
CNN (G-convolution, G-pooling, nonlinearity) and show is faster than transforming a large feature map.
that each one commutes with G-transformations of the do-
main of the image. It is customary to add a bias term to each feature map
in a convolution layer. This can be done for G-conv
6.1. G-Equivariant correlation layers as well, as long as there is only one bias per G-
feature map (instead of one bias per spatial feature plane
The correlation (eq. 7) is computed by shifting a filter and within a G-feature map). Similarly, batch normalization
then computing a dot product with the feature maps. By (Ioffe & Szegedy, 2015) should be implemented with a sin-
replacing the shift by a more general transformation from gle scale and bias parameter per G-feature map in order
some group G, we get the G-correlation used in the first to preserve equivariance. The sum of two G-equivariant
layer of a G-CNN: feature maps is also G-equivariant, thus G-conv layers
XX can be used in highway networks and residual networks
[f ⋆ ψ](g) = fk (y)ψk (g −1 y). (10)
(Srivastava et al., 2015; He et al., 2015).
y∈Z2 k
Notice that both the input image f and the filter ψ are func- 6.2. Pointwise nonlinearities
tions of the plane Z2 , but the feature map f ⋆ψ is a function
on the discrete group G (which may contain translations as Equation 12 shows that G-correlation preserves the trans-
a subgroup). Hence, for all layers after the first, the filters ψ formation properties of the previous layer. What about non-
must also be functions on G, and the correlation operation linearities and pooling?
becomes Recall that we think of feature maps as functions on G. In
this view, applying a nonlinearity ν : R → R to a feature
XX
[f ⋆ ψ](g) = fk (h)ψk (g −1 h). (11)
h∈G k map amounts to function composition. We introduce the
Group Equivariant Convolutional Networks
composition operator a function on the quotient space G/H, in which two trans-
formations are considered equivalent if they are related by
Cν f (g) = [ν ◦ f ](g) = ν(f (g)). (14) a transformation in H.
which acts on functions by post-composing them with ν. As an example, in a p4 feature map, we can pool over all
four rotations at each spatial position (the cosets of the sub-
Since the left transformation operator L acts by pre- group R of rotations around the origin). The resulting fea-
composition, C and L commute: ture map is a function on Z2 ∼ = p4/R, i.e. it will transform
in the same way as the input image. Another example is
Cν Lh f = ν ◦ [f ◦ h−1 ] = [ν ◦ f ] ◦ h−1 = Lh Cν f, (15)
given by a feature map on Z, where we could pool over the
cosets of the subgroup nZ of shifts by multiples of n. This
so the rectified feature map inherits the transformation
gives a feature map on Z/nZ, which has a cyclic transfor-
properties of the previous layer.
mation law under translations.
6.3. Subgroup pooling and coset pooling This concludes our analysis of G-CNNs. Since all layer
types are equivariant, we can freely stack them into deep
In order to simplify the analysis, we split the pooling op- networks and expect G-conv parameter sharing to be effec-
eration into two steps: the pooling itself (performed with- tive at arbitrary depth.
out stride), and a subsampling step. The non-strided max-
pooling operation applied to a feature map f : G → R can
be modeled as an operator P that acts on f as 7. Efficient Implementation
Computing the G-convolution for involves nothing more
P f (g) = max f (k), (16)
k∈gU than indexing arithmetic and inner products, so it can
be implemented straightforwardly. Here we present
where gU = {gu | u ∈ U } is the g-transformation of some the details for a G-convolution implementation that can
pooling domain U ⊂ G (typically a neighborhood of the leverage recent advances in fast computation of planar
identity transformation). In a regular convnet, U is usually convolutions (Mathieu et al., 2014; Vasilache et al., 2015;
a 2 × 2 or 3 × 3 square including the origin (0, 0), and g is Lavin & Gray, 2015).
a translation.
A plane symmetry group G is called split if any transfor-
As shown in the supplementary material, pooling com- mation g ∈ G can be decomposed into a translation t ∈ Z2
mutes with Lh : and a transformation s in the stabilizer of the origin (i.e. s
P Lh = Lh P (17) leaves the origin invariant). For the group p4, we can write
g = ts for t a translation and s a rotation about the origin,
Since pooling tends to reduce the variation in a feature map, while p4m splits into translations and rotation-flips. Us-
it makes sense to sub-sample the pooled feature map, or ing this split of G and the fact that Lg Lh = Lgh , we can
equivalently, to do a “pooling with stride”. In a G-CNN, rewrite the G-correlation (eq. 10 and 11) as follows:
the notion of “stride” is generalized by subsampling on a XX
subgroup H ⊂ G. That is, H is a subset of G that is itself a f ⋆ ψ(ts) = fk (h)Lt [Ls ψk (h)] (18)
group (i.e. closed under multiplication and inverses). The h∈X k
subsampled feature map is then equivariant to H but not G. where X = Z2 in layer one and X = G in further layers.
In a standard convnet, pooling with stride 2 is the same as Thus, to compute the p4 (or p4m) correlation f ⋆ ψ we can
pooling and then subsampling on H = {(2i, 2j) |(i, j) ∈ first compute Ls ψ (“filter transformation”) for all four ro-
Z2 } which is a subgroup of G = Z2 . For the p4-CNN, we tations (or all eight rotation-flips) and then call a fast planar
may subsample on the subgroup H containing all 4 rota- correlation routine on f and the augmented filter bank.
tions, as well as shifts by multiples of 2 pixels.
The computational cost of the algorithm presented here is
We can obtain full G-equivariance by choosing our pooling roughly equal to that of a planar convolution with a filter
region U to be a subgroup H ⊂ G. The pooling domains bank that is the same size as the augmented filter bank used
gH that result are called cosets in group theory. The cosets in the G-convolution, because the cost of the filter transfor-
partition the group into non-overlapping regions. The fea- mation is negligible.
ture map that results from pooling over cosets is invariant
to the right-action of H, because the cosets are similarly in-
7.1. Filter transformation
variant (ghH = gH). Hence, we can arbitrarily choose one
coset representative per coset to subsample on. The feature The set of filters at layer l is stored in an array F [·] of shape
map that results from coset pooling may be thought of as K l × K l−1 × S l−1 × n × n, where K l is the number of
Group Equivariant Convolutional Networks
channels at layer l, S l−1 denotes the number of transfor- state of the art, which uses prior knowledge about rotations
mations in G that leave the origin invariant (e.g. 1, 4 or 8 (Schmidt & Roth, 2012) (see table 1).
for Z2 , p4 or p4m filters, respectively), and n is the spa-
Next, we replaced each convolution by a p4-convolution
√
tial (or translational) extent of the filter. Note that typically,
(eq. 10 and 11), divided the number of filters by 4 =
S 1 = 1 for 2D images, while S l = 4 or S l = 8 for l > 1.
2 (so as to keep the number of parameters approximately
The filter transformation Ls amounts to a permutation of fixed), and added max-pooling over rotations after the last
the entries of each of the K l × K l−1 scalar-valued filter convolution layer. This architecture (P4CNN) was found
channels in F . Since we are applying S l transformations to to perform better without dropout, so we removed it. The
each filter, the output of this operation is an array of shape P4CNN almost halves the error rate of the previous state of
K l × S l × K l−1 × S l−1 × n × n, which we call F + . the art (2.28% vs 3.98% error).
The permutation can be implemented efficiently by a GPU We then tested the hypothesis that premature invariance is
kernel that does a lookup into F for each output cell of undesirable in a deep architecture (section 2). We took
F + , using a precomputed index associated with the output the Z2CNN, replaced each convolution layer by a p4-
cell. To precompute the indices, we define an invertible convolution (eq. 10) followed by a coset max-pooling over
map g(s, u, v) that takes an input index (valid for an array rotations. The resulting feature maps consist of rotation-
of shape S l−1 × n × n) and produces the associated group invariant features, and have the same transformation law as
element g as a matrix (section 4.2 and 4.3). For each in- the input image. This network (P4CNNRotationPooling)
put index (s, u, v) and each transformation s′ , we compute outperforms the baseline and the previous state of the art,
s̄, ū, v̄ = g −1 (g(s′ , 0, 0)−1 g(s, u, v)). This index is used but performs significantly worse than the P4CNN which
to set F + [i, s′ , j, s, u, v] = F [i, j, s̄, ū, v̄] for all i, j. does not pool over rotations in intermediate layers.
The G-convolution for a new group can be added by simply
implementing a map g(·) from indices to matrices. Network Test Error (%)
Larochelle et al. (2007) 10.38 ± 0.27
7.2. Planar convolution Sohn & Lee (2012) 4.2
Schmidt & Roth (2012) 3.98
The second part of the G-convolution algorithm is a pla- Z2CNN 5.03 ± 0.0020
nar convolution using the expanded filter bank F + . If P4CNNRotationPooling 3.21 ± 0.0012
S l−1 > 1, the sum over X in eq. 18 involves a sum over P4CNN 2.28 ± 0.0004
the stabilizer. This sum can be folded into the sum over fea-
ture channels performed by the planar convolution routine Table 1. Error rates on rotated MNIST (with standard deviation
by reshaping F + from K l × S l × K l−1 × S l−1 × n × n to under variation of the random seed).
S l K l ×S l−1 K l−1 ×n×n. The resulting array can be inter-
preted as a conventional filter bank with S l−1 K l−1 planar
input channels and S l K l planar output channels, which can 8.2. CIFAR-10
be correlated with the feature maps f (similarly reshaped). The CIFAR-10 dataset consists of 60k images of size 32 ×
32, divided into 10 classes. The dataset is split into 40k
8. Experiments training, 10k validation and 10k testing splits.
8.1. Rotated MNIST We compared the p4-, p4m- and standard planar Z2
convolutions on two kinds of baseline architectures.
The rotated MNIST dataset (Larochelle et al., 2007) con- Our first baseline is the All-CNN-C architecture by
tains 62000 randomly rotated handwritten digits. The Springenberg et al. (2015), which consists of a sequence of
dataset is split into a training, validation and test sets of 9 strided and non-strided convolution layers, interspersed
size 10000, 2000 and 50000, respectively. with rectified linear activation units, and nothing else. Our
We performed model selection using the validation set, second baseline is a residual network (He et al., 2016),
yielding a CNN architecture (Z2CNN) with 7 layers of which consists of an initial convolution layer, followed by
3 × 3 convolutions (4 × 4 in the final layer), 20 chan- three stages of 2n convolution layers using ki filters at stage
nels in each layer, relu activation functions, batch normal- i, followed by a final classification layer (6n + 2 layers in
ization, dropout, and max-pooling after layer 2. For op- total). The first convolution in each stage i > 1 uses a stride
timization, we used the Adam algorithm (Kingma & Ba, of 2, so the feature map sizes are 32, 16, and 8 for the three
2015). This baseline architecture outperforms the mod- stages. We use n = 7, ki = 32, 64, 128 yielding a wide
els tested by Larochelle et al. (2007) (when trained on 12k 44-layer network called ResNet44.
and evaluated on 50k), but does not match the previous To evaluate G-CNNs, we replaced all convolution layers of
Group Equivariant Convolutional Networks
the baseline architectures by p4 or p4m convolutions. For a 9. Discussion & Future work
constant number of filters, this increases the size of the fea-
ture maps 4 or 8-fold, which in turn increases the number of Our results show that p4 and p4m convolution layers can
parameters required per filter in the next layer. Hence, we be used as a drop-in replacement of standard convolutions
that consistently improves the results.
√ of filters in each p4-conv layer, and divide
halve the number
it by roughly 8 ≈ 3 in each p4m-conv layer. This way, G-CNNs benefit from data augmentation in the same way
the number of parameters is left approximately invariant, as convolutional networks, as long as the augmentation
while the size of the internal representation is increased. comes from a group larger than G. Augmenting with flips
Specifically, we used ki = 11, 23, 45 for p4m-ResNet44. and small translations consistently improves the results for
To evaluate the impact of data augmentation, we compare the p4 and p4m-CNN.
the networks on CIFAR10 and augmented CIFAR10+. The The CIFAR dataset is not actually symmetric, since objects
latter denotes moderate data augmentation with horizon- typically appear upright. Nevertheless, we see substantial
tal flips and small translations, following Goodfellow et al. increases in accuracy on this dataset, indicating that there
(2013) and many others. need not be a full symmetry for G-convolutions to be ben-
The training procedure for training the All-CNN was re- eficial.
produced as closely as possible from Springenberg et al. In future work, we want to implement G-CNNs that work
(2015). For the ResNets, we used stochastic gradient de- on hexagonal lattices which have an increased number of
scent with initial learning rate of 0.05 and momentum 0.9. symmetries relative to square grids, as well as G-CNNs for
The learning rate was divided by 10 at epoch 50, 100 and 3D space groups. All of the theory presented in this paper is
150, and training was continued for 300 epochs. directly applicable to these groups, and the G-convolution
can be implemented in such a way that new groups can
Network G CIFAR10 CIFAR10+ Param. be added by simply specifying the group operation and a
bijective map between the group and the set of indices.
All-CNN Z2 9.44 8.86 1.37M
p4 8.84 7.67 1.37M One limitation of the method as presented here is that
p4m 7.59 7.04 1.22M it only works for discrete groups. Convolution on con-
ResNet44 Z2 9.45 5.61 2.64M tinuous (locally compact) groups is mathematically well-
p4m 6.46 4.94 2.62M defined, but may be hard to approximate in an equivari-
ant manner. A further challenge, already identified by
Table 2. Comparison of conventional (i.e. Z2 ), p4 and p4m CNNs Gens & Domingos (2014), is that a full enumeration of
on CIFAR10 and augmented CIFAR10+. Test set error rates and transformations in a group may not be feasible if the group
number of parameters are reported. is large.
Finally, we hope that the current work can serve as a con-
To the best of our knowledge, the p4m-CNN outperforms crete example of the general philosophy of “structured rep-
all published results on plain CIFAR10 (Wan et al., 2013; resentations”, outlined in section 2. We believe that adding
Goodfellow et al., 2013; Lin et al., 2014; Lee et al., 2015b; mathematical structure to a representation (making sure
Srivastava et al., 2015; Clevert et al., 2015; Lee et al., that maps between representations preserve this structure),
2015a). However, due to radical differences in model sizes could enhance the ability of neural nets to see abstract sim-
and architectures, it is difficult to infer much about the in- ilarities between superficially different concepts.
trinsic merit of the various techniques. It is quite possi-
ble that the cited methods would yield better results when
deployed in larger networks or in combination with other 10. Conclusion
techniques. Extreme data augmentation and model ensem- We have introduced G-CNNs, a generalization of convolu-
bles can also further improve the numbers (Graham, 2014). tional networks that substantially increases the expressive
Inspired by the wide ResNets of Zagoruyko & Komodakis capacity of a network without increasing the number of
(2016), we trained another ResNet with 26 layers and parameters. By exploiting symmetries, G-CNNs achieve
ki = (71, 142, 248) (for planar convolutions) or ki = state of the art results on rotated MNIST and CIFAR10.
(50, 100, 150) (for p4m convolutions). When trained with We have developed the general theory of G-CNNs for dis-
moderate data augmentation, this network achieves an er- crete groups, showing that all layer types are equivariant to
ror rate of 5.27% using planar convolutions, and 4.19% the action of the chosen group G. Our experimental results
with p4m convolutions. This result is comparable to the show that G-convolutions can be used as a drop-in replace-
4.17% error reported by Zagoruyko & Komodakis (2016), ment for spatial convolutions in modern network architec-
but using fewer parameters (7.2M vs 36.5M). tures, improving their performance without further tuning.
Group Equivariant Convolutional Networks
Lee, C., Xie, S., Gallagher, P.W., Zhang, Z., and Tu, Springenberg, J.T., Dosovitskiy, A., Brox, T., and Ried-
Z. Deeply-Supervised Nets. In Proceedings of the miller, M. Striving for Simplicity: The All Convolu-
Eighteenth International Conference on Artificial Intel- tional Net. Proceedings of the International Conference
ligence and Statistics (AISTATS), volume 38, pp. 562– on Learning Representations (ICLR), 2015.
570, 2015b.
Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhu-
Lenc, K. and Vedaldi, A. Understanding image represen- ber, Jürgen. Training Very Deep Networks. Advances in
tations by measuring their equivariance and equivalence. Neural Information Processing Systems (NIPS), 2015.
In Proceedings of the IEEE Conf. on Computer Vision
Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Pi-
and Pattern Recognition (CVPR), 2015.
antino, S., and LeCun, Y. Fast convolutional nets with
Lin, M., Chen, Q., and Yan, S. Network In Network. fbfft: A GPU performance evaluation. In International
International Conference on Learning Representations Conference on Learning Representations (ICLR), 2015.
(ICLR), 2014. Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus,
R. Regularization of neural networks using dropconnect.
Lowe, D.G. Distinctive Image Features from Scale-
International Conference on Machine Learning (ICML),
Invariant Keypoints. International Journal of Computer
pp. 109–111, 2013.
Vision, 60(2):91–110, nov 2004.
Zagoruyko, S. and Komodakis, N. Wide Residual Net-
Manay, Siddharth, Cremers, Daniel, Hong, Byung Woo, works. arXiv:1605.07146, 2016.
Yezzi, Anthony J., and Soatto, Stefano. Integral invari-
ants for shape matching. IEEE Transactions on Pattern Zhang, C., Voinea, S., Evangelopoulos, G., Rosasco, L.,
Analysis and Machine Intelligence, 28(10):1602–1617, and Poggio, T. Discriminative template learning in
2006. ISSN 01628828. doi: 10.1109/TPAMI.2006.208. group-convolutional networks for invariant speech rep-
resentations. InterSpeech, pp. 3229–3233, 2015.
Mathieu, M., Henaff, M., and LeCun, Y. Fast Training of
Convolutional Networks through FFTs. In International
Conference on Learning Representations (ICLR), 2014.
∂L X ∂L ∂f l (h)
k
lk
= l (h) ∂ψ lk (g)
∂ψj (g) h
∂f k j
X ∂L lk −1 ′
X
l−1 ∂ψ (h h )
fk′ (h′ ) k lk
′
= l (h)
h
∂f k h ,k
′ ′
∂ψ j (g)
X ∂L
= f l−1 (hg)
l (h) j
h
∂f k
∂L l−1
= ∗ fj (g)
∂fkl
(24)
So we see that both the forward and backward passes in-
volve convolution or correlation operations, as is the case
in standard convnets.