Coupled Oscillatory Recurrent Neural Network (Cornn) : An Accurate and (Gradient) Stable Architecture For Learning Long Time Dependencies
Coupled Oscillatory Recurrent Neural Network (Cornn) : An Accurate and (Gradient) Stable Architecture For Learning Long Time Dependencies
Coupled Oscillatory Recurrent Neural Network (Cornn) : An Accurate and (Gradient) Stable Architecture For Learning Long Time Dependencies
October 5, 2020
arXiv:2010.00951v1 [cs.LG] 2 Oct 2020
Abstract
Circuits of biological neurons, such as in the functional parts of the brain can be modeled as
networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs
while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent
neural networks. Our proposed RNN is based on a time-discretization of a system of second-order
ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove
precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and
vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in
performance to the state of the art on a variety of benchmarks, demonstrating the potential of this
architecture to provide stable and accurate RNNs for processing complex sequential data.
1 Introduction
Recurrent neural networks (RNNs) have achieved tremendous success in a variety of tasks involving
sequential (time series) inputs and outputs, ranging from speech recognition to computer vision and
natural language processing, among others. However, it is well known that training RNNs to process
inputs over long time scales (input sequences) is notoriously hard on account of the so-called exploding
and vanishing gradient problem (EVGP) [33], which stems from the fact that the well-established BPTT
algorithm for training RNNs requires computing products of gradients (Jacobians) of the underlying
hidden states over very long time scales. Consequently, the overall gradient can grow (to infinity) or decay
(to zero) exponentially fast with respect to the number of recurrent interactions.
A variety of approaches have been suggested to mitigate the exploding and vanishing gradient problem.
These include adding gating mechanisms to the RNN in order to control the flow of information in the
network, leading to architectures such as long short-term memory (LSTM) [21] and gated recurring units
(GRU) [10], that can overcome the vanishing gradient problem on account of the underlying additive
structure. However, the gradients might still explode and learning very long term dependencies remains
a challenge [30]. Another popular approach for handling the EVGP is to constrain the structure of
underlying recurrent weight matrices by requiring them to be orthogonal (unitary), leading to the so-called
orthogonal RNNs [20, 2, 42, 24] and references therein. By construction, the resulting Jacobians have eigen-
and singular-spectra with unit norm, alleviating the EVGP. However as pointed out in [24], imposing such
constraints on the recurrent matrices may lead to a significant loss of expressivity of the RNN resulting in
inadequate performance on realistic tasks.
In this article, we adopt a different approach, based on observation that coupled networks of controlled
non-linear forced and damped oscillators, that arise in many physical, engineering and biological systems
such as networks of biological neurons, do seem to ensure expressive representations while constraining
the dynamics of state variables and their gradients. This motivates us to propose a novel architecture for
RNNs, based on time-discretizations of second-order systems of non-linear ordinary differential equations
∗ Seminar for Applied Mathematics (SAM), D-MATH
1
(ODEs) (1) that model coupled oscillators. For these RNNs, we are able to rigorously prove precise bounds
on the hidden states and their gradients, enabling the solution of the exploding and vanishing gradient
problem, while demonstrated through benchmark numerical experiments, that the resulting system still
retains sufficient expressivity, with a performance comparable to the state of the art on a variety of
sequential learning tasks.
y 00 = 𝜎 (Wy + Wy 0 + Vu + b) − 𝛾y − 𝜖y 0 . (1)
Here, 𝑡 ∈ [0, 1] is the (continuous) time variable, u = u(𝑡) ∈ R𝑑 is the time-dependent input signal,
y = y(𝑡) ∈ R𝑚 is the hidden state of the RNN with W, W ∈ R𝑚×𝑚 , V ∈ R𝑚×𝑑 are weight matrices, b ∈ R𝑚
is the bias vector and 0 < 𝜖, 𝛾 are parameters. 𝜎 : R ↦→ R is the activation function, set to 𝜎(𝑢) = tanh(𝑢)
here. By introducing the so-called velocity variable z = y 0 (𝑡) ∈ R𝑚 , we rewrite (1) as the first-order system:
We fix a timestep 0 < Δ𝑡 < 1 and define our proposed RNN hidden states at time 𝑡 𝑛 = 𝑛Δ𝑡 ∈ [0, 1] (while
omitting the affine output state) as the following IMEX (implicit-explicit) discretization of the first order
system (2):
y𝑛 = y𝑛−1 + Δ𝑡z𝑛 ,
(3)
z𝑛 = z𝑛−1 + Δ𝑡𝜎 (Wy𝑛−1 + Wz𝑛−1 + Vu𝑛 + b) − Δ𝑡𝛾y𝑛−1 − Δ𝑡𝜖z𝑛 .
Motivation and background. We term the RNN (3) as coupled oscillatory Recurrent Neural Network
(coRNN), because each neuron is a controlled forced, damped nonlinear oscillator [18], with diagonal entries
of W and W controlling the frequency and amount of damping of the oscillation, respectively, whereas
the non-diagonal entries of these matrices modulate interactions between neurons in the network. The
parameters V, b modulate the effect of the driving force proportional to the input signal u(𝑡) and the tanh
activation mediates a non-linear response. We provide heuristics for the dynamics of oscillator networks
in SM§B, where we demonstrate with simple examples that a network of (forced, driven) oscillators can
access a very rich set of output states, in particular oscillatory input signals can yield non-oscillatory
outputs. This ability of such systems to express a variety of output states indicating the possibility of
high expressivity for the proposed RNN.
Such oscillator networks are ubiquitous in nature and in engineering systems [18, 38] with canonical
examples being pendulums (classical mechanics), business cycles (economics), heartbeat (biology) for
single oscillators and electrical circuits for networks of oscillators. Our motivating examples arises in
neurobiology, where individual biological neurons can be viewed as oscillators with periodic spiking and
firing of the action potential. Moreover, functional circuits of the brain, such as cortical columns and
prefrontal-striatal-hippocampal circuits, are being increasingly interpreted by networks of oscillatory
neurons, see [37] for an overview and [17] for modeling specific brain functions such as interval timing
and working memory as oscillatory neural networks. Following well-established paths in machine learning
such as for convolutional neural networks [29], our focus here is to abstract the essence of functional brain
circuits being networks of oscillators and design an RNN based on much simpler mechanistic systems such
as those modeled by (2), while ignoring the complicated biological details of neural function.
Related work. While there are many examples of ODE and dynamical systems inspired RNN ar-
chitectures, these approaches can roughly be distinguished into two branches, namely RNNs based on
discretized ODEs and continuous-time RNNs. Examples of continuous-time approaches include neural
ODEs [8] with ODE-RNNs [35] as its recurrent extension as well as [13] and references therein, to name
just a few. We focus, however, in this article on an ODE-inspired discrete-time RNN, as the proposed
coRNN is derived from a discretized ODE. A prominent example for discrete-time ODE-based RNNs is
the so-called anti-symmetric RNN of [6], where the RNN architecture is based on a stable ODE using a
2
skew-symmetric hidden weight matrix. We also mention hybrid methods, which use a discretization of an
ODE (in particular a Hamiltonian system) in order to learn the continuous representation of the data,
see for instance [15, 9]. Our approach here differs from these papers in our explicit use of networks of
oscillators, with the underlying biological motivation.
y𝑛 = y𝑛−1 + Δ𝑡z𝑛 ,
(4)
z𝑛 = 1z𝑛− Δ𝑡
+Δ𝑡 + 1+Δ𝑡 𝜎(A𝑛−1 ) −
1 Δ𝑡
1+Δ𝑡 y𝑛−1 , A𝑛−1 := Wy𝑛−1 + Wz𝑛−1 + Vu𝑛 + b.
Bounds on the hidden states. As the hidden states in the RNN (3) are the outputs of a network of
driven, damped oscillators, with a bounded (tanh) nonlinearity, it is straightforward to obtain,
Proposition 3.1. Let y𝑛 , z𝑛 be the hidden states of the RNN (4) for 1 ≤ 𝑛 ≤ 𝑁, then the hidden states
satisfy the following (energy) bounds:
y> >
𝑛 y𝑛 + z𝑛 z𝑛 ≤ 𝑛𝑚Δ𝑡 = 𝑚𝑡 𝑛 ≤ 𝑚. (5)
The proof of the energy bound (5) is provided in SM§C.1 and a straightforward variant of the proof
(see SM§C.2) yields an estimate on the sensitivity of the hidden states to changing inputs. In particular,
this bound rules out chaotic behavior of hidden states.
Bounds on hidden state gradients. We train the RNN (3) to minimize the loss function,
𝑁
1 ∑︁ 1
E := E𝑛 , E𝑛 = ky𝑛 − ȳ𝑛 k 22 , (6)
𝑁 𝑛=1 2
with ȳ being the underlying ground truth (training data). During training, we compute gradients of the
loss function (6) with respect to the weights and biases Θ = [W, W, V, b], i.e.
𝑁
𝜕E 1 ∑︁ 𝜕E𝑛
= , ∀ 𝜃 ∈ Θ. (7)
𝜕𝜃 𝑁 𝑛=1 𝜕𝜃
Proposition 3.2. Let y𝑛 , z𝑛 be the hidden states generated by the RNN (4). We assume that the time
step Δ𝑡 can be chosen such that,
Δ𝑡 (1 + kWk ∞ ) Δ𝑡 kWk ∞ 1
max , = 𝜂 ≤ Δ𝑡 𝑟 , ≤𝑟 ≤1 (8)
1 + Δ𝑡 1 + Δ𝑡 2
Denoting 𝛿 = 1
1+Δ𝑡 , the gradient of the loss function E (6) with respect to any parameter 𝜃 ∈ Θ is bounded
as,
𝜕E
≤ 𝐶𝑚𝛿, (9)
𝜕𝜃
3
+
Here, the notation 𝜕𝜕𝜃X𝑘 refers to taking the partial derivative of X 𝑘 with respect to the parameter 𝜃, while
keeping the other arguments constant. This quantity can be readily calculated from the structure of the
RNN (4) and is presented in the detailed proof provided in SM§C.3. From (6), we can directly compute
that 𝜕E
𝜕X𝑛 = [y𝑛 − ȳ𝑛 , 0] .
𝑛
Repeated application of the chain rule and a direct calculation with (4) yields,
𝜕X𝑛 Ö 𝜕X𝑖 𝜕X𝑖 I + Δ𝑡B𝑖−1 Δ𝑡C𝑖−1
= , = , (11)
𝜕X 𝑘 𝑘<𝑖 ≤𝑛 𝜕X𝑖−1 𝜕X𝑖−1 B𝑖−1 C𝑖−1
B𝑖−1 = 𝛿Δ𝑡 (diag(𝜎 0 (A𝑖−1 ))W − I) , C𝑖−1 = 𝛿 (I + Δ𝑡 diag(𝜎 0 (A𝑖−1 ))W) . (12)
It is straightforward to calculate using the assumption (8) that kB𝑖−1 k ∞ < 𝜂 and kC𝑖−1 k ∞ ≤ 𝜂 + 𝛿. Using
the definitions of matrix norms and (8), we obtain:
𝜕X𝑖
𝜕X𝑖−1
≤ max (1 + Δ𝑡 (kB𝑖−1 k ∞ + kC𝑖−1 k ∞ ), kB𝑖−1 k ∞ + kC𝑖−1 k ∞ )
∞ (13)
≤ max (1 + Δ𝑡 (𝛿 + 2𝜂), 𝛿 + 2𝜂) ≤ 1 + 3Δ𝑡 𝑟 .
Note that we have used an expansion around Δ𝑡 and neglected terms of O (Δ𝑡 2𝑟 ) as Δ𝑡 << 1. We remark
that the bound (13) is the crux of our argument about gradient control as we see from the structure of
the RNN that the recurrent matrices have close to unit norm. In order to complete the proof, one has to
substitute the bound (14) in (10) and estimate the product (and the sum) carefully to obtain the desired
bound (9). This is done in the detailed proof, presented in SM§C.3. As the entire gradient of the loss
function (6), with respect to the weights and biases of the network, is bounded above in (9), the exploding
gradient problem is mitigated for this RNN.
(𝑘)
On the vanishing gradient problem. The vanishing gradient problem [33] arises if 𝜕E𝜕𝜃𝑛 , defined
in (10), → 0 exponentially fast in 𝑘, for 𝑘 << 𝑛 (long-term dependencies). In that case, the RNN does
not have long-term memory, as the contribution
of the
𝑘-th hidden state to error at time step 𝑡 𝑛 is
infinitesimally small. We already see from (14) that
𝜕𝜕X
X 𝑘
≈ 1 (independently of k). Thus, we should not
𝑛
∞
expect the products in (10) to decay fast. In fact, we will provide a much more precise characterization of
this gradient. To this end, we introduce the following order -notation,
For simplicity of notation, we will also set ȳ𝑛 = u𝑛 ≡ 0, for all 𝑛, b = 0 and 𝑟 = 1 in (8) and we will only
consider 𝜃 = W𝑖, 𝑗 for some 1 ≤ 𝑖, 𝑗 ≤ 𝑚 in the following proposition.
Proposition 3.3. Let y𝑛 be the hidden states generated by the RNN (4). Under the assumption that
√
y𝑖𝑛 = O ( 𝑡 𝑛 ), for all 1 ≤ 𝑖 ≤ 𝑚 and (8), the gradient for long-term dependencies satisfies,
𝜕E𝑛(𝑘) 3
5
√
= O 𝑐ˆ𝛿Δ𝑡 2 + O 𝑐ˆ𝛿Δ𝑡 2 + O (Δ𝑡 3 ), with 𝑐ˆ = 𝑠𝑒𝑐ℎ2 𝑘Δ𝑡 (1 + Δ𝑡) 𝑘 << 𝑛, (16)
𝜕𝜃
3
This precise bound (16) on the gradient shows that although the gradient can be small i.e O (Δ𝑡 2 ), it
is in fact independent of 𝑘, ensuring that long-term dependencies contribute to gradients at much later
4
steps and mitigating the vanishing gradient problem.
Sketch of proof. By an induction argument, detailed in SM§C.5, we can prove the following representation
formula for products of Jacobians in (14):
Í1 Î
𝑛− 𝑘
I Δ𝑡 C𝑖
𝜕X𝑛 Ö 𝜕X𝑖
𝑗=𝑘 𝑖= 𝑗
+ O (Δ𝑡)
= =
𝑘
𝑗+1
𝑘
(17)
𝜕X 𝑘 𝜕X𝑖−1 B𝑛−1 + Í Î Î
𝑘<𝑖 ≤𝑛
C𝑖 B 𝑗 C𝑖
𝑗=𝑛−2 𝑖=𝑛−1 𝑖=𝑛−1
𝜕E𝑛(𝑘)
= y> 2 > 2 ∗ 𝑖, 𝑗 3 (18)
𝑖, 𝑗
𝑛 Δ𝑡 𝛿Z𝑚,𝑑 (A 𝑘−1 )y 𝑘−1 + y𝑛 Δ𝑡 𝛿C Z𝑚,𝑑 (A 𝑘−1 )y 𝑘−1 + O (Δ𝑡 ),
𝜕𝜃
with matrix C∗ defined as,
𝑛−1 Ö
∑︁ 𝑘
C∗ := C𝑖 ,
𝑗=𝑘 𝑖= 𝑗
𝑖, 𝑗
and Z𝑚,𝑑 (A 𝑘−1 ) ∈ R𝑚×𝑑 is a matrix with all elements are zero except for the (𝑖, 𝑗)-th entry which is set to
𝜎 0 (A 𝑘−1 )𝑖 , i.e. the 𝑖-th entry of 𝜎 0 (A 𝑘−1 ). It is straightforward to verify the bound (16) using definitions
(12), assumption (8) and elementary but tedious calculations, detailed in SM§C.5.
4 Experiments
We test our proposed RNN architecture on a variety of different learning tasks, ranging from pure synthetic
tasks designed to learn long-term dependencies (LTD) to more realistic tasks which also require high
expressivity of the network. Details of the training procedure for each experiment can be found in SM§A.
We wish to clarify here that we use a straightforward hyperparameter tuning protocol and do not use
additional performance enhancing tools such as dropout [36], gradient clipping [33] or batch normalization
[22], which might further improve the performance of coRNNs. Code to replicate the experiments can be
found at https://github.com/tk-rusch/coRNN .
0.200
T = 500 0.200
T = 2000 0.200
T = 5000
0.175 0.175 0.175
MSE
MSE
Figure 1: Results of the adding problem for the coRNN, FastRNN, anti.sym. RNN and tanh RNN based
on three different sequence lengths 𝑇, i.e. 𝑇 = 500, 𝑇 = 2000 and 𝑇 = 5000.
Adding problem. We start with the well-known adding problem, first proposed in [21], to test the
ability of an RNN to learn (very) long-term dependencies. The input is a two-dimensional sequence of
length 𝑇, with the first dimension consisting of random numbers drawn from U ( [0, 1]) and with two
non-zero entries in the second dimension, both set to 1 and chosen at random, but one each in both
halves of the sequence. The output is the sum of two numbers of the first dimension at the positions
which are indicated by the two 1 entries in the second dimension. Thus the goal of this experiment is to
beat the baseline output of 1, whose mean square error (MSE) equals the variance of 0.167. We compare
our coRNN to two recently proposed RNNs, which were explicitly designed to learn LTDs, namely the
5
FastRNN [26] and the antisymmetric (anti.sym.) RNN [6]. To emphasize the challenging nature of this
experiment, we also show the results of a plain-vanilla RNN with tanh activation. All methods have 128
hidden units as well as the same training protocol is used in all cases. Fig. 1 shows the results for different
lengths 𝑇 of the input sequences. We can see that while the tanh RNN is not able to beat the baseline for
any sequence length, the FastRNN as well as the anti.sym. RNN successfully learn the adding task for
𝑇 = 500. However, in this case, the coRNN converges significantly faster and reaches a lower test MSE
than other tested methods. When setting the length to 𝑇 = 2000, the difficulty of solving the adding
problem increases considerably. In fact, most of recent publications only consider lengths of 𝑇 ≤ 1000. We
can see that in this case i.e. 𝑇 = 2000, only the coRNN solves the problem within a reasonable number of
training steps. In order to further demonstrate the superiority of coRNN over recently proposed RNN
architectures for learning LTDs, we consider the adding problem for 𝑇 = 5000. Since all other methods
failed for 𝑇 = 2000, we only train the coRNN on this task. We can see that even in this case, the coRNN
converges very quickly. We thus conclude that the coRNN mitigates the vanishing/exploding gradient
problem for this example, even for very long sequences.
Table 1: Test accuracies on sMNIST and psMNIST (we provide our own psMNIST result for the FastGRNN,
as no official result for this task has been published so far)
Sequential (permuted) MNIST. Sequential MNIST (sMNIST) [27] is a benchmark for RNNs, in
which the model is required to classify an MNIST [28] digit one pixel at a time leading to a classification
task with a sequence length of 𝑇 = 784. In permuted sequential MNIST (psMNIST), a fixed random
permutation is applied in order to increase the time-delay between interdependent pixels and to make the
problem harder. In Table 1, we compare the test accuracy of the coRNN on sMNIST and psMNIST with
recently published results for other recurrent models which were explicitly designed to solve long-term
dependencies together with baselines corresponding to gated and unitary RNNs. To the best of our
knowledge the proposed coRNN outperforms all single-layer recurrent architectures, published in the
literature, for both the sMNIST and psMNIST. Moreover in Fig. 2, we present the performance (with
respect to number of epochs) of different RNN architectures for psMNIST with the same fixed random
perturbation and the same number of hidden units, i.e. 128. As seen from this figure, coRNN clearly
outperforms the other architectures, some of which were explicitly designed to learn LTDs, handily for
this perturbation.
Noise padded CIFAR10 Another challenging test problem for learning LTDs is the recently proposed
noise-padded CIFAR10 experiment [6], in which CIFAR10 data points [25] are fed to the RNN row-wise
and flattened along the channels resulting in sequences of length 32. To test the long term memory, entries
of uniform random numbers are added such that the resulting sequences have a length of 1000, i.e. the
last 968 entries of each sequences are only noise to distract the network. Table 2 shows the result for the
coRNN together with other recently published results. We observe that coRNN readily outperforms the
state-of-the-art on this benchmark, while requiring only 128 hidden units.
Our theoretical guarantees regarding the non-exploding/non-vanishing gradient depend on the weight
assumptions (8), which need to be fulfilled throughout the whole training procedure. We check this
6
1.0
0.10
1
0.9 0.08 Upper bound: η = ∆t2
∆t(1+kWk∞)
test accuracy
1+∆t
0.8 coRNN 0.06 ∆tkWk∞
DTRIV∞ 1+∆t
0.00
0.50 20 40 60 80 0 10000 20000 30000 40000 50000
epoch learning steps
Figure 2: Performance on psMNIST for different Figure 3: Weight assumptions (8) evaluated during
models, all with 128 hidden units and the same fixed training for the noise padded CIFAR10 experiment.
random permutation.
assumption for the noisy CIFAR10 experiment in Fig. 3, where we plot the relevant quantities on both
sides of the inequality (8), with 𝑟 = 21 . As seen from the figure, although the norms grow slightly during
training, they are well below the needed bound, thus verifying (8) for this example. We also provide a
theoretical argument for why this assumption (8) can be satisfied during training in SM§C.4.
Human activity recognition. This experiment is based on the human activity recognition data set
[1]. The data set is a collection of tracked human activities, which were measured by an accelerometer
and gyroscope on a Samsung Galaxy S3 smartphone. Six activities were binarized to obtain two merged
classes {Sitting, Laying, Walking Upstairs} and {Standing, Walking, Walking Downstairs}, leading to
the HAR-2 data set, which was first proposed in [26]. Table 3 shows the result of the coRNN together
with other very recently published results on the same data set. We can see that the coRNN readily
outperforms all other methods. We also ran this experiment on a tiny coRNN with very few parameters,
i.e. only 1k. We can see that even in this case, the tiny coRNN beats all baselines. We thus conclude that
the coRNN can efficiently be used on resource-constrained IoT micro-controllers.
IMDB sentiment analysis. The IMDB data set [31] is a collection of 50k movie reviews where 25k
reviews are used for training (with 7.5k of these reviews used for validating) and 25k reviews are used for
testing. The aim of this binary sentiment classification task is to decide whether a movie review is positive
or negative. We use a dictionary size of 25k words and follow the standard procedure by initializing the
word embedding with pretrained 100d GloVe [34] vectors. Table 4 shows the results for the coRNN and
7
Table 3: Test accuracies on HAR-2
other recently published models which are trained similarly and have the same number of hidden units, i.e.
128. We can see that the coRNN compares favorable with gated baselines (which are known to perform
very well on this task) while at the same time requiring significantly less parameters.
5 Discussion
Inspired by many models in physics, biology and engineering, particularly by circuits of biological neurons
[37, 17], we proposed a novel RNN architecture (3) based on a model (1) of coupled controlled forced
and damped oscillators. For this RNN, we rigorously showed that the hidden states are bounded (5) and
obtained precise bounds on the gradients (Jacobians) of the hidden states, (9) and (16). Thus by design,
this architecture is proved to mitigate the exploding and vanishing gradient problem (EVGP) and this
is also verified in a series of numerical experiments. Furthermore, these experiments also demonstrate
that the proposed RNN shows sufficient expressivity for performing complex tasks. In particular, the
results showed that the proposed RNN was comparable to (or better than) other state of the art RNNs
for a variety of tasks including sequential image classification, activity recognition and sentiment analysis.
Moreover, the proposed RNN was able to show comparable performance to other RNNs with significantly
fewer tuning parameters. Thus, we provide a novel and promising strategy for designing RNN architectures
that are motivated by the functioning of biological neural networks, have rigorous bounds on hidden state
gradients and are robust, accurate, straightforward to train and cheap to evaluate.
This work can be extended in many different directions. Our main theoretical focus in this paper
was to demonstrate the mitigation of the exploding and vanishing gradient problem. On the other hand,
we only provided some heuristics and numerical evidence on why the proposed RNN still has sufficient
expressivity. A priori, it is natural to think that the proposed RNN architecture will introduce a strong
bias towards oscillatory functions. However, we argue in SM§B, the proposed coRNN can be significantly
more expressive as the damping, forcing and coupling of several oscillators modulates nonlinear response
8
to yield a very rich and diverse set of output states. This is also evidenced by the ability of the coRNN
to be comparable to (and better than) the state of the art for all the presented numerical experiments,
which do not have an explicit oscillatory structure. To further investigate the issue of expressivity, we aim
to prove expressivity rigorously in the future by showing some sort of universality of the proposed coRNN
architecture, as in the case of echo state networks in [16]. One possible approach would be to leverage
the ability of the proposed RNN to convert general inputs into a rich set of superpositions of harmonics
(oscillatory wave forms). One might then adapt the approach of Barron [3], where expressing functions
in terms of superpositions of oscillatory functions (Fourier basis) was the key to universality results, to
the context of the proposed RNN. Results on global dynamics of networks of oscillators, reviewed in [39]
might be useful.
The proposed RNN was based on the simplest model of coupled oscillators (1). Much more detailed
models of oscillators are available, particularly those that arise in the modeling of biological neurons,
[37] and references therein. An interesting variant of our proposed RNN would be to base the RNN
architecture on these more elaborate models, resulting in analogues of the spiking neurons model [32] for
RNNs. These models might result in better expressivity than the proposed RNN, while still keeping the
gradients under control. Another avenue of extension is to add gates to the proposed coRNN architecture,
possibly improving expressivity further. Using first principles derivations of gated dynamics [40] would be
instrumental for this task.
Acknowledgements.
The research of SM and TKR was partially supported by European Research Council Consolidator grant
ERCCoG 770880: COMANFLO.
References
[1] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. Human activity recognition on
smartphones using a multiclass hardware-friendly support vector machine. In International Workshop
on Ambient Assisted Living, pages 216–223. Springer, 2012.
[2] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In International
Conference on Machine Learning, pages 1120–1128, 2016.
[3] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Transcations on Information Theory, 39, 1993.
[4] V. Campos, B. Jou, X. Giró-i-Nieto, J. Torres, and S. Chang. Skip RNN: learning to skip state
updates in recurrent neural networks. In 6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
[5] M. L. Casado. Trivializations for gradient-based optimization on manifolds. In Advances in Neural
Information Processing Systems, pages 9154–9164, 2019.
[6] B. Chang, M. Chen, E. Haber, and E. H. Chi. Antisymmetricrnn: A dynamical system view on
recurrent neural networks. In 7th International Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019, 2019.
[7] S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui, M. Witbrock, M. A. Hasegawa-Johnson,
and T. S. Huang. Dilated recurrent neural networks. In Advances in Neural Information Processing
Systems, pages 77–87, 2017.
[8] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations.
In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
[9] Z. Chen, J. Zhang, M. Arjovsky, and L. Bottou. Symplectic recurrent neural networks. In 8th
International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, 2020.
9
[10] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
representations using rnn encoder-decoder for statistical machine translation. In Conference on
Empirical Methods in Natural Language Processing (EMNLP 2014), 2014.
[11] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. C. Courville. Recurrent batch normalization.
In 5th International Conference on Learning Representations, ICLR, 2017.
[12] R. Dey and F. M. Salemt. Gate-variants of gated recurrent unit (gru) neural networks. In 2017
IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), pages 1597–1600.
IEEE, 2017.
[13] W. E. A proposal on machine learning via dynamical systems. Commun. Math. Stat, 5:1–11, 2017.
[14] N. B. Erichson, O. Azencot, A. Queiruga, and M. W. Mahoney. Lipschitz recurrent neural networks.
arXiv preprint arXiv:2006.12070, 2020.
[15] S. Greydanus, M. Dzamba, and J. Yosinski. Hamiltonian neural networks. In Advances in Neural
Information Processing Systems, pages 15379–15389, 2019.
[16] L. Grigoryeva and J.-P. Ortega. Echo state networks are universal. Neural Networks, 108:495 – 508,
2018. ISSN 0893-6080.
[17] B.-M. Gu, H. vanRijn, and W. K. Meck. Oscillatory multiplexing of neural population codes for
interval timing and working memory. Neuroscience and Behaviorial reviews,, 48:160–185, 2015.
[18] J. Guckenheimer and P. Holmes. Nonlinear oscillations, dynamical systems, and bifurcations of vector
fields. Springer Verlag, New York, 1990.
[19] S. S. H. Sakaguchi and Y. Kuramoto. Local and global self-entrainment in oscillator lattices. Progress
of Theoretical Physics, 77:1005–1010, 1987.
[20] M. Henaff, A. Szlam, and Y. LeCun. Recurrent orthogonal networks and long-memory tasks. In
M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on
Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2034–2042, 2016.
[21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,
1997.
[22] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML,
volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015.
[23] A. Kag, Z. Zhang, and V. Saligrama. Rnns incrementally evolving on an equilibrium manifold:
A panacea for vanishing and exploding gradients? In 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
[24] G. Kerg, K. Goyette, M. P. Touzel, G. Gidel, E. Vorontsov, Y. Bengio, and G. Lajoie. Non-normal
recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with
transient dynamics. In Advances in Neural Information Processing Systems, pages 13591–13601, 2019.
[25] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[26] A. Kusupati, M. Singh, K. Bhatia, A. Kumar, P. Jain, and M. Varma. Fastgrnn: A fast, accurate,
stable and tiny kilobyte sized gated recurrent neural network. In Advances in Neural Information
Processing Systems, pages 9017–9028, 2018.
[27] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear
units. arXiv preprint arXiv:1504.00941, 2015.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
10
[29] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436–444, 2015.
[30] S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao. Independently recurrent neural network (indrnn):
Building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5457–5466, 2018.
[31] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for
sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, volume 1, pages 142–150. Association for Computational
Linguistics, 2011.
[32] W. Maass. Fast sigmoidal networks via spiking neurons. Neural Computation, 9:279–304, 2001.
[33] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In
Proceedings of the 30th International Conference on International Conference on Machine Learning,
volume 28 of ICML’13, page III–1310–III–1318. JMLR.org, 2013.
[34] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 1532–1543, 2014.
[35] Y. Rubanova, R. T. Q. Chen, and D. K. Duvenaud. Latent ordinary differential equations for
irregularly-sampled time series. In Advances in Neural Information Processing Systems 32, pages
5320–5330. 2019.
[36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):
1929–1958, 2014.
[37] K. M. Stiefel and G. B. Ermentrout. Neurons as oscillators. Journal of Neurophysiology, 116:
2950–2960, 2016.
[38] S. Strogatz. Nonlinear Dynamics and Chaos. Westview, Boulder CO, 2015.
[39] S. H. Strogatz. Exploring complex networks. Nature, 410:268–276, 2001.
[40] C. Tallec and Y. Ollivier. Can recurrent networks warp time. In International Conference on Learning
Representations, ICLR, 2018.
[41] A. T. Winfree. Biological rhythms and the behavior of populations of coupled oscillators. Journal of
Theoretical Biology, 16:15–42, 1967.
[42] S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas. Full-capacity unitary recurrent neural
networks. In Advances in Neural Information Processing Systems, pages 4880–4888, 2016.
11
Supplementary Material for:
Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture
for learning long time dependencies
A Training details
The IMDB task was conducted on an NVIDIA GeForce GTX 1080 Ti GPU, while all other experiments
were run on a Intel Xeon E3-1585Lv5 CPU. The weights and biases of the coRNN are randomly initialized
according to U (− √𝑛1𝑖𝑛 , √𝑛1𝑖𝑛 ), where 𝑛𝑖𝑛 denotes the input dimension of each affine transformation. While
the z𝑛 equation in (3) can be solved implicitly or explicitly given the z𝑛 control term, i.e. using Δ𝑡𝜖z𝑛
or Δ𝑡𝜖z𝑛−1 , the presented results are based on the explicit form. However, we point out that no major
difference in the results was obtained when using the implicit form instead of the explicit form. Additionally,
instead of treating the parameters Δ𝑡, 𝛾 and 𝜖 as fixed hyperparameters, we can also treat them as trainable
network parameters by constraining Δ𝑡 to [0, 1] by using a sigmoidal activation function and 𝜖, 𝛾 > 0 by
the use of ReLU for instance. However, also in this case no major performance difference is obtained. The
hyperparameters are optimized with a random search algorithm, where the results of the best performing
coRNN (based on the validation set) are reported. The ranges of the hyperparameters for the random
search algorithm are provided in Table 5. Table 6 shows the rounded hyperparameters of the best
performing coRNN architecture resulting from the random search algorithm for each learning task. We
used 100 training epochs for the sMNIST and psMNIST problem with additional 20 epochs in which the
learning rate was reduced by a factor of 10. Additionally, we used 100 epochs for the IMDB task and 250
epochs for all other experiments.
Table 5: Setting for the hyperparameter optimization of the coRNN. Intervals denote ranges of the corre-
sponding hyperparameter for the grid search algorithm, while fixed numbers mean that no hyperparameter
optimization was done in this case.
12
B Heuristics of network function
To see that the RNN (3) models a coupled network of controlled forced, damped nonlinear oscillators, we
start with the single neuron (scalar) case by setting 𝑑 = 𝑚 = 1 in (1) and assume an identity activation
function 𝜎(𝑥) = 𝑥. Setting W = W = V = b = 𝜖 = 0 leads to the simple ODE, y 00 + 𝛾y = 0, which exactly
models simple harmonic motion, for instance that of a mass attached to a spring [18]. Letting 𝜖 > 0 in (1)
adds damping or friction to the system [18]. Then, by introducing non-zero V in (1), we drive the system
with a driving force proportional to the input signal u(𝑡). The parameters V, b modulate the effect of
the driving force, W controls the frequency of oscillations and W the amount of damping in the system.
Finally, the tanh activation mediates a non-linear response in the oscillator. This picture can be readily
generalized, when the full network is considered. Then, each neuron updates its hidden state based on the
input signal as well as information from other neurons. The diagonal entries of W and W control the
frequency and amount of damping for each neuron, respectively, whereas the non-diagonal entries of these
matrices modulate interactions between neurons.
At the level of a single neuron, the dynamics of the RNN is relatively straightforward. We start with
the scalar case i.e. 𝑚 = 𝑑 = 1 and illustrate different hidden states y as a function of time, for different
input signals, in Fig. 4. In this figure, we consider two different input signals, one oscillatory signal given
by u(𝑡) = cos(4𝑡) and another is a combination of step functions. First, we plot the solution y(𝑡) of (1),
with the parameters V, b, W, W, 𝜖 = 0 and 𝛾 = 1. This simply corresponds to the case of a simple harmonic
oscillator (SHO) and the solution is described by a sine wave with the natural frequency of the oscillator.
Next, we introduce forcing by the input signal by setting V = 1 and the activation function is the identity
𝜎(𝑥) = 𝑥, leading to a forced damped oscillator (FDO). As seen from Fig. 4, in the case of an oscillatory
signal, this leads to a very minor change over the SHO, whereas for the step function, the change is only
in the amplitude of the wave. Next, we add damping by setting 𝜖 = 0.25 and see that the resulting forced
damped oscillator (FDO), merely damps the amplitude of the waves, without changing their frequency.
Then, we consider the case of controlled oscillator (CFDO) by setting W = −2, V = 2, b = 0.25, W = 0.75.
As seen from Fig. 4, this leads to a significant change in the wave form in both cases. For the oscillatory
input, the output is now a superposition of many different forms, with different amplitudes and frequencies
(phases) whereas for the step function input, the phase is shifted. Already, we can see that for a linear
controlled oscillator, the output can be very complicated with the superposition of different waves. This
holds true when the activation function is set to 𝜎(𝑥) = tanh(𝑥) (which is our proposed coRNN). For both
inputs, the output is a modulated version of the one generated by CFDO, expressed as a superposition of
waves. On the other hand, we also plot the solution with a Duffing type oscillator (DUFF) by setting the
activation function as,
𝑥3
𝜎(𝑥) = 𝑥 − . (19)
3
In this case, the solution is very different from the CFDO and coRNN solutions and is heavily damped
(either in the output or its derivative). On the other hand, given the chaotic nature of the dynamical
system in this case, a slight change in the parameters led to the output blowing up. Thus, a bounded
nonlinearity seems essential in this context.
Coupling neurons together further accentuates this generation of superpositions of different wave-forms,
as seen even with the simplest case of a network with two neurons, shown in Fig. 4 (Bottom row). For this
figure, we consider two neurons i.e 𝑚 = 2 and two different network topologies. For the first, we only allow
the first neuron to influence the second one and not vice versa. This is enforced with the weight matrices,
−2 0 0.75 0
W= , W= .
3 −2 −1 0.75
We also set V = [2, 2] > , b = [0.25, 0.25] > . Note that in this case (we name as ORD (for ordered
connections)), the output of the first neuron should be exactly as the same as in the uncoupled (UC) case,
whereas there is a distinct change in the output of the second neuron and we see that the first neuron has
modulated a sharp change in the resulting output wave form. It is well illustrated by the emergence of an
approximation to the step function (Bottom Right of Fig. 4), even though the input signal is oscillatory.
13
Next, we consider the case of fully connected (FC) neurons by setting the weight matrices as,
−2 1 0.75 0.3
W= , W=
3 −2 −1 0.75
The resulting outputs for the first neuron are now slightly different from the uncoupled case. On the the
other hand, the approximation of step function output for the second neuron is further accentuated.
Even these simple examples illustrate the functioning of a network of controlled oscillators well. The
input signal is converted into a superposition of waves with different frequencies and amplitudes, with
these quantities being controlled by the weights and biases in (1). Thus, very complicated outputs can be
generated by modulating the number, frequencies and amplitudes of the waves. In practice, a network
of a large number of neurons is used and can lead to extremely rich global dynamics, along the lines of
emergence of synchronization or bistable heterogeneous behavior seen in systems of idealized oscillators
and explained by their mean field limit, see [19, 41, 39]. Thus, we argue that the ability of the network of
(forced, driven) oscillators to access a very rich set of output states can lead to high expressivity of the
system. The training process selects the weights that modulate frequencies, phases and amplitudes of
individual neurons and their interaction to guide the system to its target output.
1
1
0.5
0.5
0 0
SHO
-0.5 FHO
-0.5 FDO
CFDO
-1 DUFF
coRNN
-1
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
TIME TIME
1.4 2
1.2
1.5
1
1
0.8
U
0.5
0.6
SHO
0.4 0
FHO
FDO
0.2 -0.5 CFDO
DUFF
0 -1 coRNN
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
TIME
TIME
0.8 0.8
UC UC
0.6 0.6 ORD
ORD
FC FC
0.4 0.4
Y1
Y2
0.2 0.2
0 0
-0.2 -0.2
-0.4 -0.4
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
TIME TIME
Figure 4: Illustration of the hidden state y of the coRNN (3) with a scalar input signal u (Top, Middle,
Left) with one neuron with state y (Top and Middle, Right) and two neurons with states y1 (Bottom left),
and y2 (Bottom right), corresponding to scalar input signal, shown in Top Left. Legend is SHO (simple
harmonic oscillator), FHO (forced oscillator), FDO (forced and damped oscillator), CFDO (controlled
forced and damped oscillator), DUFF (Duffing type) UC (Uncoupled), Ord (ordered coupling) and FC
(fully coupled). Legend explained in the text
14
C.1 Proof of Proposition 3.1
We multiply (y> >
𝑛−1 , z𝑛 ) to (3) and use the elementary identities,
y> >
𝑛 y𝑛 + z𝑛 z𝑛 y> y𝑛−1 + z> 𝑛−1 z𝑛−1 (y𝑛 − y𝑛−1 ) > (y𝑛 − y𝑛−1 )
= 𝑛−1 +
2 2 2
(z𝑛 − z𝑛−1 ) > (z𝑛 − z𝑛−1 )
− + Δ𝑡z𝑛 𝜎(A𝑛−1 ) − Δ𝑡z>
>
𝑛 z𝑛
2
> >
y y𝑛−1 + z𝑛−1 z𝑛−1 Δ𝑡 >
≤ 𝑛−1 + Δ𝑡 (1/2 + Δ𝑡/2 − 1) z> 𝑛 z𝑛 + 𝜎 (A𝑛−1 )𝜎(A𝑛−1 )
2 2
y> y𝑛−1 + z> 𝑛−1 z𝑛−1 𝑚Δ𝑡
≤ 𝑛−1 + as 𝜎 2 ≤ 1 and 𝜖 > Δ𝑡 << 1.
2 2
Iterating the above inequality 𝑛 leads to the energy bound,
as y0 = z0 = 0.
Proposition C.1. Let y𝑛 , z𝑛 be the hidden states of the trained RNN (4) with respect to the input
𝑁
u = {u𝑛 } 𝑛= 1 and let y 𝑛 , z𝑛 be the hidden states of the same RNN (4), but with respect to the input
𝑁
u = {u𝑛 } 𝑛=1 , then the differences in the hidden states are bounded by,
The proof of this proposition is completely analogous to the proof of proposition 3.1, we subtract
y𝑛 = y𝑛−1 + Δ𝑡z𝑛 ,
(22)
z𝑛 = 1z𝑛− Δ𝑡
+Δ𝑡 + 1+Δ𝑡 𝜎(A𝑛−1 ) −
1 Δ𝑡
1+Δ𝑡 y 𝑛−1 , A𝑛−1 := Wy𝑛−1 + Wz𝑛−1 + Vu𝑛 + b.
from (4) and multiply (y𝑛 − y𝑛 ) > , (z𝑛 − z𝑛 ) > to the difference. The estimate (21) follows identically to
15
Similarly from (3), we calculate,
⊥ ⊥⊥
Δ𝑡 2 𝑖, 𝑗 Δ𝑡 𝑖, 𝑗
1+𝜖 Δ𝑡 Z𝑚,𝑚 (A 𝑘−1 )y 𝑘−1 , 1+𝜖 Δ𝑡 Z𝑚,𝑚 (A 𝑘−1 )y 𝑘−1 if 𝜃 = (𝑖, 𝑗)−th entry of W,
⊥ ⊥⊥
Δ𝑡 2
𝜃 = (𝑖, 𝑗)−th entry of W,
𝑖, 𝑗 Δ𝑡 𝑖, 𝑗
1+𝜖 Δ𝑡 Z𝑚,𝑚 (A 𝑘−1 )z 𝑘−1 , 1+𝜖 Δ𝑡 Z𝑚,𝑚 (A 𝑘−1 )z 𝑘−1 if
𝜕 + X𝑘
= ⊥ ⊥⊥ (24)
𝜕𝜃 Δ𝑡 2 𝑖, 𝑗 Δ𝑡 𝑖, 𝑗
Z
1+𝜖 Δ𝑡 𝑚,𝑑 (A )u
𝑘−1 𝑘 , Z
1+𝜖 Δ𝑡 𝑚,𝑑 (A )u
𝑘−1 𝑘 if 𝜃 = (𝑖, 𝑗)−th entry of V,
⊥ ⊥⊥
Δ𝑡 2 𝑖, 1 Δ𝑡 𝑖, 1
1+𝜖 Δ𝑡 Z𝑚,1 (A 𝑘−1 ) , 1+𝜖 Δ𝑡 Z𝑚,1 (A 𝑘−1 ) if 𝜃 = 𝑖−th entry of b,
𝑖, 𝑗
where Z𝑚,𝑑 (A 𝑘−1 ) ∈ R𝑚×𝑑 is a matrix with all elements are zero except for the (𝑖, 𝑗)-th entry which is set
to 𝜎 0 (A 𝑘−1 )𝑖 , i.e. the 𝑖-th entry of 𝜎 0 (A 𝑘−1 ). We easily see that kZ𝑚,𝑑 (A 𝑘−1 ) k ∞ ≤ 1 for all 𝑖, 𝑗, 𝑑, 𝑚 and
𝑖, 𝑗
We will estimate the above term, just for the case of 𝜃 is an entry of W, the rest of the terms are very
similar to estimate.
As ȳ is the ground truth, we assume that it is bounded and can neglect it in the estimate. Also for
simplicity of notation, we let 𝑘 − 1 ≈ 𝑘 and aim to estimate the term,
𝜕E (𝑘)
𝑛
≤ ky𝑛 k ∞ ky 𝑘 k ∞ (1 + 3(𝑛 − 𝑘)Δ𝑡 𝑟 )𝛿Δ𝑡
𝜕𝜃
√ (26)
≤ 𝑚 𝑛𝑘Δ𝑡 (1 + 3(𝑛 − 𝑘)Δ𝑡 𝑟 )𝛿Δ𝑡 by (5)
√ √
≤ 𝑚 𝑛𝑘𝛿Δ𝑡 2 + 3𝑚 𝑛𝑘 (𝑛 − 𝑘)𝛿Δ𝑡 𝑟 +2 .
To further analyze the above estimate, we assume that 𝑛Δ𝑡 = 𝑡 𝑛 ≤ 1 and consider two different regimes.
Let us start by considering short-term dependencies by letting 𝑘 ≈ 𝑛, i.e 𝑛 − 𝑘 = 𝑐 with constant 𝑐 ∼ O (1),
independent of 𝑛, 𝑘 . In this case, a straightforward application of the above assumptions in the bound
(26) yields,
𝜕E (𝑘) √ √
≤ 𝑚 𝑛𝑘𝛿Δ𝑡 2 + 3𝑚 𝑛𝑘 (𝑛 − 𝑘)𝛿Δ𝑡 𝑟 +2
𝑛
𝜕𝜃
≤ 𝑚𝑡 𝑛 𝛿Δ𝑡 + 𝑚𝑐𝑡 𝑛 𝛿Δ𝑡 𝑟 +1 (27)
≤ 𝐶𝑡 𝑛 𝑚𝛿Δ𝑡 (as 𝑟 ≥ 1/2)
≤ 𝐶𝑚𝛿Δ𝑡,
for a constant 𝐶 independent of 𝑛, 𝑘.
Next, we consider long-term dependencies by setting 𝑘 << 𝑛 and estimating,
𝜕E (𝑘) √ √
≤ 𝑚 𝑛𝑘𝛿Δ𝑡 2 + 3𝑚 𝑛𝑘 (𝑛 − 𝑘)𝛿Δ𝑡 𝑟 +2
𝑛
𝜕𝜃
√ 3 3 1
≤ 𝑚 𝑡 𝑛 𝛿Δ𝑡 2 + 3𝑚𝑡 𝑛2 𝛿Δ𝑡 𝑟 + 2 (28)
3
𝑟 + 12
≤ 𝑚𝛿Δ𝑡 + 3𝑚𝛿Δ𝑡
2 (as 𝑡 𝑛 < 1)
𝑟 + 12
≤ 𝐶𝑚𝛿Δ𝑡 (as 𝑟 ≤ 1).
16
Thus, in all case, we have that,
𝜕E (𝑘)
𝑛
≤ 𝐶𝑚𝛿Δ𝑡 (as 𝑟 ≥ 1/2). (29)
𝜕𝜃
Applying the above estimate in (10) allows us to bound the gradient by,
𝜕E𝑛 ∑︁ 𝜕E (𝑘)
𝑛
𝜕𝜃 ≤ ≤ 𝐶𝑛𝑚𝛿Δ𝑡 = 𝐶𝑚𝛿𝑡 𝑛 . (30)
𝜕𝜃
1 ≤𝑘 ≤𝑛
Therefore, the gradient of the loss function (6) can be bounded as,
𝑁
∑︁ 𝜕E𝑛 𝐶𝑚𝛿 ∑︁ 𝑁
≤ 1
𝜕E 𝑚𝛿Δ𝑡 ∑︁
𝜕𝜃 𝑁
𝜕𝜃
≤ 𝑡 𝑛 = 𝑛 ≤ 𝐶𝑚𝛿𝑁Δ𝑡 = 𝐶𝑚𝛿, (31)
𝑛
𝑁 𝑛=1 𝑁 𝑛=1
In order to calculate the minimum number of steps 𝐿 in the gradient descent method (33) such that
the condition (8), we set ℓ = 𝐿 in (33) and applying it to the condition (8) leads to the straightforward
estimate,
1
𝐿≥ (34)
𝐶𝜁 𝑚 Δ𝑡 1−𝑟 𝛿2
2
Note that the constant 𝐶 ∼ O (1) and the parameter 𝛿 < 1, while in general, the learning rate 𝜁 << 1.
Thus, as long as 𝑟 ≤ 1, we see that the assumption (8) holds for a very large number of steps of the
gradient descent method. We remark that the above estimate (34) is a large underestimate on 𝐿. In the
experiments presented in this article, we are able to take a very large number of training steps, while the
gradients remain within a range (see Fig. 3).
17
with B, C defined in (12). By the assumption (8), one can readily check that k 𝑀 ˜ 𝑖−1 k ∞ ≤ Δ𝑡, for all
𝑘 ≤ 𝑖 ≤ 𝑛 − 1.
We will use an induction argument to show the representation formula (17). We start by the outermost
product and calculate,
𝜕X𝑛 𝜕X𝑛−1
˜ 𝑛−1 𝑀𝑛−2 + Δ𝑡 𝑀 ˜ 𝑛−2
= 𝑀𝑛−1 + Δ𝑡 𝑀
𝜕X𝑛−1 𝜕X𝑛−2
˜ 𝑛−1 𝑀𝑛−2 + 𝑀𝑛−1 𝑀
= 𝑀𝑛−1 𝑀𝑛−2 + Δ𝑡 ( 𝑀 ˜ 𝑛−2 ) + O (Δ𝑡 2 ).
Using the definitions in (12) and (8), we can easily see that
C𝑛−1 B𝑛−2 0
= O (Δ𝑡).
0 B𝑛−1 C𝑛−2
Similarly, it is easy to show that
˜ 𝑛−1 𝑀𝑛−2 , 𝑀𝑛−1 𝑀
𝑀 ˜ 𝑛−2 ∼ O (Δ𝑡).
𝜕E𝑛(𝑘)
𝑚
∑︁
= 𝛿Δ𝑡 2 𝜎 0 (𝑎 𝑖𝑘−1 )y𝑛 y 𝑘−1 + 𝛿Δ𝑡 2 𝜎 0 (𝑎 𝑖𝑘−1 ) Cℓ∗ 𝑗 y𝑛 y 𝑘−1 + O (Δ𝑡 3 ),
𝑗 𝑗 𝑗 𝑗
(35)
𝜕𝜃 ℓ=1
with y𝑛 denoting the 𝑗-th element of vector y𝑛 , the matrix C∗ , defined in (18) and
𝑗
𝑚
∑︁ 𝑚
∑︁
𝑎 𝑖𝑘−1 := W𝑖ℓ yℓ𝑘−1 + W𝑖ℓ zℓ𝑘−1 (36)
ℓ=1 ℓ=1
kWk ∞ , kWk ∞ ≤ 1 + Δ𝑡
√
Therefore by the fact that 𝜎 0 = 𝑠𝑒𝑐ℎ2 , the assumption y𝑖𝑘 = O ( 𝑡 𝑘 ) and (36), we obtain,
√
𝑐ˆ = 𝑠𝑒𝑐ℎ2 ( 𝑘Δ𝑡 (1 + Δ𝑡) ≤ 𝜎 0 (𝑎 𝑖𝑘−1 ) ≤ 1. (37)
18
It is easy to see from the definition of C𝑖 (12) that
𝑘
Ö
C𝑖 = 𝛿 𝑗−𝑘+1 I + O (𝛿 𝑗−𝑘+1 Δ𝑡 𝑗−𝑘+1 ).
𝑖= 𝑗
Summing over 𝑖 and using the fact that 𝑘 << 𝑛, we obtain that
𝛿 1
C∗ = I + O (Δ𝑡) = I + O (Δ𝑡). (39)
1−𝛿 Δ𝑡
Plugging (39) and (37) into (35) leads to,
𝑚
∑︁
3
0
2
Cℓ∗ 𝑗 y𝑛 y 𝑘−1 = O 𝑐ˆ𝛿Δ𝑡 2 + O (Δ𝑡 3 )
𝑗 𝑗
𝛿Δ𝑡 𝜎 (𝑎 𝑖𝑘−1 ) (40)
ℓ=1
19