Understanding Convolutional Neural Networks With A Mathematical Model
Understanding Convolutional Neural Networks With A Mathematical Model
Understanding Convolutional Neural Networks With A Mathematical Model
Mathematical Model
C.-C. Jay Kuo
Ming-Hsieh Department of Electrical Engineering
arXiv:1609.04112v2 [cs.CV] 2 Nov 2016
Abstract
This work attempts to address two fundamental questions about the
structure of the convolutional neural networks (CNN): 1) why a nonlinear ac-
tivation function is essential at the filter output of all intermediate layers? 2)
what is the advantage of the two-layer cascade system over the one-layer sys-
tem? A mathematical model called the “REctified-COrrelations on a Sphere”
(RECOS) is proposed to answer these two questions. After the CNN train-
ing process, the converged filter weights define a set of anchor vectors in the
RECOS model. Anchor vectors represent the frequently occurring patterns
(or the spectral components). The necessity of rectification is explained us-
ing the RECOS model. Then, the behavior of a two-layer RECOS system is
analyzed and compared with its one-layer counterpart. The LeNet-5 and the
MNIST dataset are used to illustrate discussion points. Finally, the RECOS
model is generalized to a multilayer system with the AlexNet as an example.
Keywords: Convolutional Neural Network (CNN), Nonlinear Activation,
RECOS Model, Rectified Linear Unit (ReLU), MNIST Dataset.
1. Introduction
There is a strong resurging interest in the neural-network-based learning
because of its superior performance in many speech and image/video under-
standing applications nowadays. The recent success of deep neural networks
(DNN) [1] is due to the availability of a large amount labeled training data
(e.g. the ImageNet) and more efficient computing hardware. It is called deep
learning since we often observe performance improvement when adding more
1
layers. The resulting networks and extracted features are called deep net-
works and deep features, respectively. There are two common neural network
architectures: the convolutional neural networks (CNNs) [2] and the recur-
rent neural networks (RNNs). CNNs are used to recognize visual patterns
directly from pixel images with variability. RNNs are designed to recognize
patterns in time series composed by symbols or audio/speech waveforms.
Both CNNs and RNNs are special types of multilayer neural networks. They
are trained with the back-propagation algorithm. We will focus on CNNs in
this work.
Although deep learning tends to outperform classical pattern recognition
methods experimentally, its superior performance is somehow difficult to ex-
plain. Without a good understanding of deep learning, we can only have a
set of empirical rules and intuitions. There has been a large amount of recent
efforts devoted to the understanding of CNNs. Examples include scattering
networks [3, 4, 5], tensor analysis [6], generative modeling [7], relevance prop-
agation [8], Taylor decomposition [9], etc. Another popular topic along this
line is on the visualization of filter responses at various layers [10, 11, 12].
It is worthwhile to point out that the CNN is a special form of the feed-
forward neural network (FNN), also known as the multi-layer perceptron
(MLP), trained with back-propagation. It was proved in [13] that FNNs are
capable of approximating any measurable function to any desired accuracy.
In short, FNNs are universal approximators. The success of CNNs in various
applications today is a reflection of the universal approximation capability
of FNNs. Despite this theoretical foundation, the internal operational mech-
anism of CNNs remains mysterious.
This research attempts to address two fundamental questions about CNNs:
1) Why a nonlinear activation operation is needed at the filter output of all
intermediate layers? 2) What is the advantage of the cascade of two layers
in comparison with a single layer? These two questions are related to each
other. The convolutional operation is a linear one. If the nonlinear operation
between every two convolutional layers is removed, the cascade of two linear
systems is equivalent to a single linear system. Then, we can simply go with
one linear system and the necessity of a multi-layer network architecture is
not obvious. Although one may argue that a multi-layer network has a multi-
resolution representation capability, this is a well known fact and has been
extensively studied before. Examples include the Gaussian and the wavelet
pyramids. There must be something deeper than the multi-resolution prop-
erty in the CNN architecture due to the adoption of the nonlinear activation
2
unit.
The existence of nonlinear activation makes the analysis of CNNs chal-
lenging. To tackle this problem, we propose a mathematical model to un-
derstand the behavior of CNNs. We view a CNN as a network formed by
basic operational units that conducts “REctified COrrelations on a Sphere
(RECOS)”. Thus, it is called the RECOS model. A set of anchor vectors is
selected for each RECOS model to capture and represent frequently occurring
patterns. For an input vector, we compute its correlation with each anchor
vector to measure their similarity. All negative correlations are rectified to
zero in the RECOS model, and the necessity of rectification is explained.
Anchor vectors are called filter weights in the CNN literature. In the
network training, weights are first initialized and then adjusted by back-
propagation to minimize a cost function. Here, we adopt a different name to
emphasize its role in representing clustered input data in the RECOS model.
After the analysis of nonlinear activation, we examine two-layer neural net-
works, where the first layer consists of either one or multiple RECOS units
while the second layer contains only one RECOS. We conduct a mathemat-
ical analysis on the behavior of the cascaded RECOS systems so as to shed
light on the advantage of deeper networks. The study concludes by analyzing
the AlexNet which is an exemplary multi-layer CNN.
To illustrate several discussion points, we use the LeNet-5 applied to
the MNIST dataset as an example. The MNIST dataset1 is formed by ten
handwritten digits (0, 1, ..., 9). All digits are size-normalized and centered
in an image of size 32 by 32. The dataset has a training set of 60,000
samples and a test set of 10,000 samples. The LeNet-5 is the latest CNN
designed by LeCun et al. [14] for handwritten and machine-printed character
recognition. Its architecture is shown in Fig. 1. The input image is an 8-bit
image of size 32 by 32. The LeNet-5 has two pairs of convolutional/pooling
layers, denoted by C1/S2 and C3/S4 in the figure, respectively. C1 has
6 filters of size 5 by 5. C3 has 16 filters of size 5 by 5. Each of them
is followed by a nonlinear activation function (e.g. the sigmoid function).
Furthermore, there are two fully connected layers, denoted by C5 and F6,
after the two pairs of cascaded convolutional/pooling/clipping operations
and before the output layer. The LeNet-5 has a strong impact on the design
of deeper networks in recent years. For example, the AlexNet proposed
1
http://yann.lecun.com/exdb/mnist/
3
Figure 1: The LeNet-5 architecture [14].
1.0 1 1
0.5
0.0 0
-1 1
-1 0 1
-6 -4 -2 0 2 4 6
4
also known as the leaky ReLU. All of them play a clipping-like operation.
The sigmoid clips the input into an interval between 0 and 1. The ReLU
clips negative values to zero while keeping positve values unchanged. The
leaky ReLU has a role similar to the ReLU but it maps larger negative
values to smaller ones by reducing the slope of the mapping function. It
is observed experimentally that, if the nonlinear operation is removed, the
system performance drops by a large margin.
Each convolutional layer is specified by its filter weights which are de-
termined in the training stage by an iterative update process. That is, they
are first initialized and then adjusted by backpropagation to minimize a cost
function. All weights are then fixed in the testing stage. These weights play
the role of “system memory”. In this work, we adopt a different name for
filter weights to emphasize their role in the testing stage. We call them “an-
chor vectors” since they serve as reference signals (or visual patterns) for
each input patch of test images. It is well known that signal convolution can
also be viewed as signal correlation or projection. For an input image patch,
we compute its correlation with each anchor vector to measure their simi-
larity. Clearly, the projection onto a set of anchor vectors offers a spectral
decomposition of an input.
Anchor vectors are usually not orthogonal and under-complete. Consider
the LeNet-5. For the first convolutional layer (C1), the input patch is of size
5 ˆ 5 “ 25. It has 6 filters (or anchor vectors) of the same size. Thus, the
dimension and the number of anchor vectors at C1 are 25 and 6, respectively.
For the second convolutional layer (C3), its input is a hybrid spatial/spectral
representation of dimension p5 ˆ 5q ˆ 6 “ 150. Then, the dimension and the
number of anchor vectors in C3 are 150 and 16, respectively.
Here, we interpret the compound operation of “convolution followed by
nonliear activation” as a mechanism to conduct “REctified COrrelations on a
Sphere (RECOS)”. Without loss of generality, we adopt the ReLU activation
function in the following discussion. All negative correlations are rectified
to zero by the ReLU operation in the RECOS model. The necessity of
rectification is explained below. To begin with, we consider a unit sphere
centered at the origin.
Origin-Centered Unit Sphere. Let x “ px1 , ¨ ¨ ¨ , xN qT be an arbitrary
vector on a unit sphere centered at the origin in the N -dimensional space,
denoted by # +
N
ÿ
x2n q1{2 “ 1 .
S“ x||x|| “ p (1)
n“1
5
a1
θ1
θ2
a2
θ3
a3
We are interested in clustering x’s with the geodesic distances over S. The
geodesic distance between vectors xi and xj in S is proportional to the mag-
nitude of their angle, which can be computed by
6
Figure 4: A gray-scale cat image and its negative image. They are negatively
correlated after mean removal. Their distance should be large (a white cat
versus a black cat).
7
Figure 5: Samples from the MNIST dataset: the orignal one (left) and the
gray-scale-reversed one (right).
y “ py1 , ¨ ¨ ¨ , yk , ¨ ¨ ¨ , yK qT , (3)
where
yk px, ak q “ maxp0, aTk xq ” RecpaTk xq. (4)
The form in Eq. (4) is ReLU. Other variants such as the sigmoid function and
the leaky ReLU are acceptable. As long as the negative correlation values
remain to be small, these vectors are weakly correlated and they will not
have a major impact on the final result.
We can further generalize the RECOS model to a translated unit sphere
$ « ff1{2 ,
& N
ÿ .
2
Sµ “ x||x ´ µ1|| “
pxn ´ µq “1 . (5)
% n“1
-
where µ “ N1 N
ř T N
n“1 xn is the mean of all xn ’s and 1 “ p1, ¨ ¨ ¨ , 1, ¨ ¨ ¨ 1q P R
is a constant vector with all elements equal to one. Sphere Sµ is a translation
of S with a new center at µ1T . This generalization is desired for the following
reason.
For vision problems, elements xn , n “ 1, ¨ ¨ ¨ , N , of x denote N pixel
values of an input image, and µ is the mean of all pixels. If the input
is a full image, its mean is the global mean that has no impact on image
understanding. It can be removed before the processing. Thus, we set µ “ 0.
8
However, if an input image is large, we often partition it into smaller patches,
and process all patches in parallel. In this case, the mean of each patch is
a local mean. It should not be removed since an integration of local means
provides a coarse view of the full image. This corresponds to the general case
in Eq. (5).
Based on Eq. (4), the output with respect to Sµ can be written as y “
py1 , ¨ ¨ ¨ , yK q, where
Although x1 , a1 k P RpN `1q , they only have N independent elements since their
first elements are computed from the remaining N elements.
Furthermore, the length of the input and anchor vectors may not be one.
We use x2 and a2 k to denote the general case. Then, we can set
x2 a2 k
x1 ” , a1 k ” . (9)
||x2 || ||a2 k ||
9
As discussed earlier, we can reverse the sign of all filter weights in the
first layer (i.e. C1) while keeping the rest of the LeNet-5 the same to obtain
a modified LeNet-5. Then, the modified LeNet-5 has the same recognition
performance against the gray-scale-reversed test dataset. This observation
can actually be proved mathematically. The input of a gray-scale-reversed
image to the first layer can be written as
xr “ 2551 ´ x, (11)
where x is the input from the original image. The mean of the elements in
xr , denoted by µr , is equal to µr “ 255 ´ µ, where µ is the mean of the
elements in x. Furthermore, the anchor vectors become
where ak is the anchor vector of the LeNet-5. Then, by following Eq. (6), the
output from the first layer of the modified LeNet-5 against the gray-scale-
reversed image can be written as
where the last term is the output from the first layer of the LeNet-5 against
the original input image. In other words, the two systems provide identical
output vectors to be used in future layers.
10
Figure 6: The receptive fields of the first- and the second-layer filters of
the LeNet-5, where each dot denotes a pixel in the input image, the 5 ˆ 5
window indicates the receptive field of the first-layer filter and the whole
13 ˆ 13 window indicates the receptive field of the second-layer filter. The
second-layer filter accepts the outputs from 5ˆ5 “ 25 first-layer filters. Only
four of them are shown in the figure for simplicity.
the cascaded systems. This analysis sheds light on the advantage of deeper
networks. In the following discussion, we begin with the cascade of one layer-
1 RECOS unit and one layer-2 RECOS unit, and then generalize it to the
cascade of multiple layer-1 RECOS units to one layer-2 RECOS unit. For
simplicity, the means of all inputs are assumed to be zero.
One-to-One Cascade. We define two anchor matrices:
whose column are anchor vectors ak and bl of the two individual RECOS
units. Clearly, A P RN ˆK and B P RKˆL . To make the analysis tractable, we
begin with the correlation analysis and will take the nonlinear rectification
effect into account at the end. For the correlation part, let y “ AT x and
z “ BT y. Then, we have
z “ BT AT x “ CT x, C ” AB. (16)
11
αn
xn
...
x a1 a2 ... ak
α n “ AT e n . (19)
where
c1n,l “ Recpcn,l q “ RecpαTn bl q. (21)
Rigorously speaking, z and z1 are not the same. The former has no rec-
tification operation while the latter applies the rectification operation to the
matrix product. Since the actual system applies the rectification operations
to the output of both layers and its final result, denoted by z2 , can be differ-
ent from z and z1 . Here, we are most interested in areas where z « z1 « z2
in the sense that they go through the same rectification processes in both
layers. Under this assumption, our above analysis holds for the unrectified
part of the input.
12
Many-to-One Cascade. It is straightforward to generalize one-to-one
cascaded case to the many-to-one cascaded case. The correlation of the first-
layer RECOS units can be written as
Y “ AT X, (22)
where
Y “ ry1 , ¨ ¨ ¨ yP s , X “ rx1 , ¨ ¨ ¨ xP s , (23)
There are P RECOS units working in parallel in the first layer. They cover
spatially adjacent regions yet share one common anchor matrix. They are
used to extract common representative patterns in different regions. The
correlation of the second-layer RECOS can be expressed as
z “ BT ỹ, (24)
z “ DT x, (25)
13
conv1
conv1
conv1
Figure 8: The MNIST dataset with 10 different background scenes are shown
in the top two rows while the output images in 6 spectral channels and 16
spectral channels of the first-layer and the second-layers with respect to the
input images given in the leftmost column are shown in the bottom three
rows. The structured background has an impact on the 6 channel responses
at the first layer yet their impact on the 16 channel responses at the second
layer diminishes. This phenomenon can be explained by the analysis given
in Section 3.
14
where c1n,l is the rectified inner product of αn and bl as given in Eq. (21).
Anchor vectors ak capture representative global patterns, but they are weak
in capturing position sensitive information. This shortcoming can be com-
pensated by modulating bl with elements of anchor-position vector αn .
We use an example to illustrate this point. First, we modify the MNIST
training and testing datasets by adding ten different background scenes ran-
domly to the original handwritten digits in the MNIST dataset [14]. They
are shown in the top two rows in Fig. 8. For the bottom three rows, we show
three input digital images in the leftmost column, the six spectral output im-
ages from the convolutional layer and the ReLU layer in the middle column
and the 16 spectral output images in the right two columns. It is difficult
to find a good anchor matrix of the first layer due to background diversity.
However, background scenes in these images are not consistent in the spatial
domain while foreground digits are. As a result, they can be filtered out
more easily by modulating anchor vectors bl in the second layer using the
anchor-position vector, αn , in the first layer.
Experiments are further conducted to provide supporting evidences. First,
we add one of the ten complex background scenes to test images randomly
and pass them to the LeNet-5 trained with the original MNIST data of clean
background. The recognition rate drops from 98.94% to 90.65%. This is be-
cause that this network has not yet seen any handwritten digits with complex
background. Next, we modify the MNIST training data by adding one of the
ten complex background scenes randomly and train the LeNet-5 using the
modified MNIST data again. The newly trained network has a correct classi-
fication rate of 98.89% and 98.86% on the original and the modified MNIST
test datasets, respectively. We see clearly that the addition of a sufficiently
diverse complex background scenes in the training data has little impact
on the capability of the LeNet-5 in recognizing images of clean background.
This is because that the complex background is not consistent with labeled
digits and, as a result, the network can focus on the foreground digits and
ignore background scenes through the cascaded two-layer architecture. Our
previous analysis provides a mathematical explanation to this experimental
result. It is also possible to understand this phenomenon by visualizing CNN
filter responses [10, 11, 12].
Role of Fully Connected Layers. A CNN can be decomposed into
two subnetworks (or subnets): the feature extraction subnet and the decision
subnet. For the LeNet-5, the feature extraction subnet consists of C1, S2, C3
and S4 while the decision subnet consists of C5, F6 and Output as shown in
15
2
1
3
(a) (b)
Figure 9: Illustration of functions of (a) C5 and (b) F6.
Fig. 9. The decision subnet has the following three roles: 1) converting the
spectral-spatial feature map from the output of S4 into one feature vector
of dimension 120 in C5; 2) adjusting anchor vectors so that they are aligned
with the coordinate-unit-vector with correct feature/digit pairing in F6; and
3) making the final digit classification decision in Output.
The functions of C5 and F6 are illustrated in Fig. 9(a) and (b), respec-
tively. C5 assigns anchor vectors to feature clusters as shown in Fig. 9(a).
There are 120 anchor vectors in the 400-D space to be assigned (or trained)
in the LeNet-5. Then, an anchor matrix is used in F6, to rotate and rescale
anchor vectors in C5 to their new positions. The objective is to ensure the
feature cluster of an object class to be aligned with the coordinate-unit-vector
of the same object class for decision making in the Output. This is illustrated
in Fig. 9 (b). Every coordinate-unit-vector in the Output is an anchor vec-
tor, and each of them represents a digit class. The softmax rule is widely
used in the Output for final decision making.
Multi-Layer CNN. We list the traditional layer names, RECOS nota-
tions, their input and the output vector dimensions of the LeNet-5 in Table
1. The output vector dimension is the same as the number of anchor vectors
in the same layer. Vector augmentation is needed in S 1 since their local mean
could be non-zero. However, it is not needed in S 2 , S 3 S 4 and S 5 since the
16
global mean is removed.
Table 1: The specification of RCS units used in the LeNet-5, where the third
column (N ) shows the dimension of the input and the fourth column (K)
shows the dimension of the output of the corresponding layer. Note that K
is also the number of anchor vectors.
LeNet-5 RECOS N K
1
C1/S2 S p5 ˆ 5q ` 1 6
C3/S4 S2 p6 ˆ 5 ˆ 5q 16
C5 S3 16 ˆ 5 ˆ 5 120
4
F6 S 120 ˆ 1 ˆ 1 84
Output S5 84 ˆ 1 ˆ 1 10
S1 S4
11
...
...
S1 S2 ... S4 S5 S6 S7 S8
11
...
...
3
S1 S4
(5x5) (3x3)
17
Table 2: The specification of RCS units used in the AlexNet, where the third
column (N ) shows the dimension of the input and the fourth column (K)
shows the dimension of the output of S l , where l “ 1, ¨ ¨ ¨ , 8. K is also the
number of anchor vectors.
AlexNet RECOS N K
Conv 1 S1 p3 ˆ 11 ˆ 11q ` 1 96
Conv 2 S2 p96 ˆ 5 ˆ 5q ` 1 256
Conv 3 S3 p256 ˆ 3 ˆ 3q ` 1 384
Conv 4 S4 p384 ˆ 3 ˆ 3q ` 1 384
Conv 5 S5 p384 ˆ 3 ˆ 3q ` 1 256
FC 6 S6 256 ˆ 1 ˆ 1 4096
FC 7 S7 4096 ˆ 1 ˆ 1 4096
FC 8 S8 4096 ˆ 1 ˆ 1 1000
to explain the advantage of the two-layer RECOS model over the single-
layer RECOS model. The proposed RECOS mathematical model is centered
on the selection of anchor vectors. CNNs do offer a very powerful tool for
image processing and understanding. There are however a few open problems
remaining in CNN interpretability and wider applications. Some of them are
listed below.
1. Application-Specific CNN Architecture
In CNN training, the CNN architecture (including the layer number and
the filter number at each layer, etc.) has to be specified in advance.
Given a fixed architecture, the filter weights are optimized by an end-to-
end optimization framework. Generally speaking, simple tasks can be
well handled by smaller CNNs. However, there is still no clear guideline
in the CNN architecture design for a class of applications. The anchor-
vector viewpoint encourages us to examine the properties of source data
carefully. A good understanding of source data distribution contributes
to the design of more efficient CNN architectures and more effective
training.
2. Robustness to Input Variations
The LeNet-5 was shown to be robust with respect to a wide range of
variations in [14]. Yet, the robustness of CNNs is challenged by recent
studies, e.g. [19]. It is an interesting topic to understand the causes of
these problems so as to design more error-resilient CNNs.
18
3. Weakly Supervised Learning
The training of CNNs demand a large amount of labeled data. It is
expensive to collect labeled data. Furthermore, the labeling rules could
be different from one dataset from another even for the same applica-
tions. It is important to reduce the labeling burden and allow CNN
training using partially and flexibly labeled data. In other words, we
need to move from the heavily supervised learning to weakly supervised
learning to make CNNs widely applicable.
4. Effective Back-propagation Training
Effective back-propagation training is important as CNNs become more
and more complicated nowadays. Several back-propagation speed-up
schemes have been proposed. One is dropout [15]. Another one is to
inject carefully chosen noise to achieve faster convergence as presented
in [20]. New methods along this direction are very much in need.
Acknowledgment
The author would like to thank Mr. Zhehang Ding’s help in running
experiments and drawing figures for this article. This material is based on
research sponsored by DARPA and Air Force Research Laboratory (AFRL)
under agreement number FA8750-16-2-0173. The U.S. Government is au-
thorized to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. The views and conclusions
contained herein are those of the authors and should not be interpreted as
necessarily representing the official policies or endorsements, either expressed
or implied, of DARPA and Air Force Research Laboratory (AFRL) or the
U.S. Government.
References
References
[1] B. H. Juang, Deep neural networks–a developmental perspective, AP-
SIPA Transactions on Signal and Information Processing 5 (2016) e7.
[2] Y. LeCun, Y. Bengio, G. E. Hinton, Deep learning, Nature 521 (2015)
436–444.
[3] S. Mallat, Group invariant scattering, Communications on Pure and
Applied Mathematics 65 (10) (2012) 1331–1398.
19
[4] J. Bruna, S. Mallat, Invariant scattering convolution networks, IEEE
transactions on pattern analysis and machine intelligence 35 (8) (2013)
1872–1886.
[5] T. Wiatowski, H. Bölcskei, A mathematical theory of deep convolutional
neural networks for feature extraction, arXiv preprint arXiv:1512.06293.
[6] N. Cohen, O. Sharir, A. Shashua, On the expressive power of deep learn-
ing: a tensor analysis, arXiv preprint arXiv:1509.05009 556.
[7] J. Dai, Y. Lu, Y.-N. Wu, Generative modeling of convolutional neural
networks, arXiv preprint arXiv:1412.6296.
[8] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller,
W. Samek, On pixel-wise explanations for non-linear classifier decisions
by layer-wise relevance propagation, PloS one 10 (7) (2015) e0130140.
[9] G. Montavon, S. Bach, A. Binder, W. Samek, K.-R. Müller, Explaining
nonlinear classification decisions with deep Taylor decomposition, arXiv
preprint arXiv:1512.02479.
[10] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional net-
works: visualising image classification models and saliency maps, arXiv
preprint arXiv:1312.6034.
[11] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional
networks, in: European Conference on Computer Vision, Springer, 2014,
pp. 818–833.
[12] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Object detec-
tors emerge in deep scene CNNs, arXiv preprint arXiv:1412.6856.
[13] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks
are universal approximators, Neural networks 2 (5) (1989) 359–366.
[14] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning
applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[15] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: F. Pereira, C. J. C. Burges,
L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information
Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105.
20
[16] W. S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in
nervous activity, The Bulletin of Mathematical Biophysics 5 (4) (1943)
115–133.
21