Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
Abstract—Multidimensional long short-term memory recur- that the training of a single network usually lasts several
rent neural networks achieve impressive results for handwriting weeks [6]. To the best of our knowledge, so far there is
recognition. However, with current CPU-based implementa- no publicly available GPU implementation of MDLSTM. In
tions, their training is very expensive and thus their capac-
ity has so far been limited. We release an efficient GPU- this work, we fill this gap and create an efficient GPU-based
based implementation which greatly reduces training times implementation which is described in Section IV and made
by processing the input in a diagonal-wise fashion. We use publicly available.
this implementation to explore deeper and wider architectures We show that for deeper networks, a simple weight
than previously used for handwriting recognition and show
initialization scheme with fixed standard deviation results
that especially the depth plays an important role. We outper-
form state of the art results on two databases with a deep in convergence issues which can be solved by using the
multidimensional network. initialization scheme by Glorot et al. [10]. Furthermore, we
use our implementation to train much larger and deeper
Keywords-MDLSTM; LSTM; Long Short-Term Memory;
Recurrent Neural Network; Handwriting Recognition; networks as typically used for handwriting recognition and
show that the results can thereby be substantially improved.
I. I NTRODUCTION
II. M ULTIDIMENSIONAL L ONG S HORT-T ERM M EMORY
Neural networks have become a key component in modern FOR H ANDWRITING R ECOGNITION
handwriting and speech recognition systems. While feedfor-
A multidimensional recurrent neural network (MDRNN)
ward neural networks only use a limited and fixed amount
is a generalization of a recurrent neural network, which
of context of the input, recurrent neural networks (RNNs)
can deal with higher-dimensional data such as videos (3D)
can in principle make use of an arbitrary amount of context
or images (2D). Here we restrict ourselves to the two
by storing information in their internal state. In particular,
dimensional case which is commonly used for handwriting
long short-term memory recurrent neural networks (LSTM-
recognition tasks. A 2D-RNN scans the input image along
RNNs) have been very successful [1], [2], [3]. The LSTM
both axes and produces a transformed output image of the
architecture allows the network to store information for
same size. The hidden state h(u, v) for position (u, v) of an
longer amounts of time and avoids vanishing and explod-
MDRNN layer is computed based on the previous hidden
ing gradients [4]. While normal LSTM-RNNs only use a
states h(u−1, v) and h(u, v−1) of both axes and the current
recurrence over one dimension (the x-axis of an image or
input x(u, v) by
the time-axis for audio), multidimensional long short-term
memory recurrent neural networks (MDLSTM-RNNs) [5] h(u, v) = σ(W x(u, v) + U h(u − 1, v) + V h(u, v − 1) + b),
use a recurrence over both axes of an input image, allowing
them to model the writing variations on both axes and to where W , U and V are weight matrices, b a bias vector
directly work on raw input images. and σ a nonlinear activation function. Like in the 1D case,
Recently, handwriting recognition competitions were won MDLSTM introduces an internal cell state for each spatial
by MDLSTM-RNNs (e.g. [6], [7]) and very recently MDL- position which is protected by several gates. The use of
STM networks have also been shown to yield promising LSTM allows the network to exploit more context and leads
results for speech recognition [8]. However, the MDLSTM to more stable training. It is common practice to use four
networks used for handwriting recognition in prior work, parallel MDLSTM layers which each process the input in
e.g. Pham et al. [9] who also use the same databases as one of the four possible directions, e.g. from the top left
we do, seem to be relatively small. One reason for this to the bottom right. The four directions are later combined
might be that usually CPU implementations are used for so that at every spatial position the full context from all
training which lead to high runtimes, e.g. Strauß et al. report, directions is available.
229
Figure 1: The basic network architecture used in this paper. The input image on the left is processed pixel-by-pixel using
a cascade of convolutional, max-pooling and MDLSTM layers, and finally transcribed by a CTC layer on the right. Figure
adapted from Pham et al. [9].
Figure 2: MDLSTM dependencies and order of computation. we sum over the CTC losses for every image and do not
(a) The incoming arrows to a pixel have their origins at normalize by the batch size. We do not use any truncation
pixels which are needed for the computation of the current or clipping of gradients. All following experiments are done
pixel. (b) Naive order of computation: the numbers indicate on GTX 980 GPUs. In order to stay within the 4GB GPU
the order of computation which is from top to bottom, from memory limit for large networks, we restrict the batch size to
left to right, one pixel at a time. (c) Diagonal order of 600k pixels for all networks, to keep the results comparable.
computation: all pixels on a common diagonal are computed During the training, we measure the CTC objective function
at the same time. value and the label error rate, i.e. the lowest character error
rate of the network itself without lexicon or language model,
1 2 3 1 2 3 on a holdout set of 10% of the training data. Training stops,
when both measures do not improve for 20 epochs. For most
4 5 6 2 3 4 networks, less than 80 epochs were necessary.
Unless stated otherwise, the images are used as input
7 8 9 3 4 5 without any resizing, only a padding of 15 white pixels are
added on all four sides of the images. The gray values have
(a) (b) (c) been linearly rescaled to values between 0 and 1. Images
for which the output sequence have a shorter length than the
Table I: Speed and memory consumption for different batch number of characters in the reference are a problem during
sizes (i.e. an upper limit on the number of pixels in a mini- training, as they cannot be transcribed correctly by the CTC
batch) using the basic model. The runtime column gives the output layer. Hence, we decided to remove the 10 affected
duration for one epoch on the IAM corpus. training images from RIMES, while on IAM this problem
didn’t occur.
Batch size Imgs/batch Runtime[min] Pixels/sec Memory[GB]
1 image 1.00 54.3 0.38M 1.06
In Pham et al. [9], all network weights are initialized
0.5M 1.53 41.5 0.49M 1.06 by a normal distribution with zero mean and standard
1.0M 3.26 24.8 0.82M 1.66 deviation 0.01 (called normal initialization in the follow-
2.5M 7.81 16.5 1.24M 3.93
3.5M 10.64 15.0 1.36M 5.64 ing). Especially when trying to train deeper networks, we
5.0M 14.88 14.5 1.41M 9.11 often experienced slow convergence with this initialization
scheme. Hence, we tried the popular initialization scheme
proposed by Glorot et al. [10]. In this scheme, the weights
based implementation, the experiments in this paper would are initialized with a uniform distribution given by
not have been possible in a reasonable amount of time. √ √
6 6
V. T RAINING AND W EIGHT I NITIALIZATION W ∼ U −√ ,√ ,
nin + nout nin + nout
For optimization, we use Adam [22] with incorporated
Nesterov momentum [23]. We start with a learning rate of where nin and nout are the number of inputs of outputs
0.0005 and decrease it to 0.0003 in epoch 25 and to 0.0001 of the layer. We performed several runs of training with a
in epoch 35. Note that for a minibatch of multiple images, relatively large network with 9 hidden layers and both weight
230
initialization schemes to study the effect of the weight Table II: Comparison of preprocessing methods. Deslanting
initialization for deep networks (see Fig. 3). The difference alone without contrast normalization yields the best results.
between multiple curves for the same initialization is only Preprocessing method WER[%] CER[%]
the random seed used for initialization. In the case of dev eval dev eval
the normal initialization, the seed has a strong influence Raw 8.5 11.0 3.0 4.3
Deslanted 8.0 10.2 2.5 3.8
on the training progress, and convergence is often very Contrast norm. 8.2 10.7 2.8 4.1
slow. On the contrary, when using the initialization by Deslanted + contrast norm. 8.7 10.3 2.7 3.9
Glorot et al. [10], the different runs show little variance
and converge much faster. We also experimented with an
orthogonal initialization based on Saxe et al. [24], which to include additional language model context. We performed
worked substantially better than the normal initialization, but recognitions with the models from different epochs including
slightly worse than the Glorot initialization. Consequently, the epochs with the lowest CTC objective function value
we adopted the initialization by Glorot et al. for all further and the lowest label error rate. Additionally, the language
experiments. model scale, prior scale and word insertion penalty were
tuned to minimize the word error rate on the development
Figure 3: Comparison of weight initializations on the IAM set. Performance is measured in word error rate (WER) and
database. The green dotted curves show the training progress character error rate (CER).
in terms of label error rate for networks initialized with
a normal distribution with fixed standard deviation and A. Preprocessing
different random seeds. The red lines show the training In Bluche et al. [27], the authors found that preprocessing
progress with Glorot initialization and different seeds. The was not helpful for their MDLSTM optical model on an
Glorot initialization leads to much faster convergence and Arabic handwriting recognition task, while in Strauß et
less dependence on the seed. al. [6], several preprocessing steps were used. Hence, we
1.0 decided to perform further experiments. In a first experi-
Normal init ment for preprocessing, we considered deslanting the line
Glorot init images and contrast normalization using the algorithms
0.8
from Kozielski et al. [28]. We used the basic network
topology to compare all four possible combinations of these
0.6 preprocessing methods in Table II. Contrast normalization
label in isolation provides a small gain, but using only deslanting
error yielded better results than combining both. Consequently,
0.4
for all further experiments, we used the deslanted images
without contrast normalization.
0.2
B. Topology
0.0
Compared to one-dimensional LSTM networks, which
0 20 40 60 80 100
commonly have more than 10 million parameters (e.g. [2]),
epoch
the MDLSTM networks used for handwriting recognition
are usually relatively small, possibly also because the sizes
VI. E XPERIMENTS of the networks are restricted by the high runtime of the
used CPU implementations. For example, the network used
In order to study the effect of preprocessing and network in Pham et al. [9] has roughly 142k parameters, while our
topology, particularly the effect of the width and depth of the basic model has 766k parameters. Hence, we study the effect
network, we conduct experiments on the IAM database in- of width and depth on the WER.
troduced in Section III. After identifying a well-performing In a first experiment, we vary the number of hidden units
setup, we also evaluate it on the RIMES database to verify per layer from 5n to 30n while keeping the number of
that the setup performs well across different databases. hidden layers fixed. The results in Table III show that a
For recognition, we used the RWTH Aachen University too small hidden layer size severely hurts the recognition
Speech Recognition System [25], [26]. We emulated the performance, but when the layer sizes are large enough,
CTC topology in a hybrid hidden Markov model fashion further increasing them yields little differences and can even
by expressing the state emission probabilities as rescaled hurt.
posteriors. We then used a regular HMM decoder with an Next, we vary the number of hidden layers and the
appropriately adjusted number of states and transition proba- position of the max pooling operations while keeping the
bilities. All recognitions were performed on paragraph level hidden layer sizes fixed at 15n where n is the layer index.
231
Table III: Comparison of different network widths. 5n means Table IV: Comparison of different network topologies. The
that the number of hidden units in the nth layer is 5n. depth of the network and the position of max pooling is
Network Params WER[%] CER[%]
varied. LP stands for a block of convolution and MDLSTM
dev eval dev eval with max pooling and L stands for such a block without
5n 88k 9.9 12.1 3.3 4.6 pooling. The deep networks with ten hidden layers achieve
10n 342k 8.4 10.3 2.7 3.9
15n 766k 8.0 10.2 2.5 3.8
the best results.
20n 1.35M 8.3 10.5 2.5 3.9 Network Hidden Params WER[%] CER[%]
30n 3.04M 8.3 10.2 2.6 3.8 layers dev eval dev eval
LP-LP-LP 6 766k 8.0 10.2 2.5 3.8
LP-L-LP-LP 8 1.68M 8.0 10.0 2.5 3.6
LP-LP-L-LP 8 1.68M 8.2 10.4 2.8 4.0
When changing the network topology, we always stack LP-LP-LP-L 8 1.68M 8.4 10.4 2.5 3.7
alternating convolution and MDLSTM layers as before. The LP-LP-LP-L-L 10 2.63M 7.1 9.3 2.4 3.5
combination of a convolutional layer and a MDLSTM layer LP-L-LP-L-LP 10 2.63M 7.2 9.3 2.4 3.5
LP-L-LP-L-LP-L 12 3.81M 8.1 9.7 2.6 3.6
can be seen as a building block, only the number of these
blocks and the presence or absence of max pooling for each
Table V: Comparison of the proposed system to results
block is changed. We describe a network architecture by a
reported by other groups on the IAM database.
string like LP-LP-LP-L-L, where LP is such a building block
with max pooling and L is a building block without max System WER[%] CER[%]
dev eval dev eval
pooling. Note that the number of parameters also strongly Our system 7.1 9.3 2.4 3.5
increases when increasing the depth, as the layer size is Doetsch et al. [2] 8.4 12.2 2.5 4.7
scaled proportional to the layer index. Since we are mainly Voigtlaender et al. [30] 8.7 12.7 2.6 4.8
Pham et al. [9] 11.2 13.6 3.7 5.1
interested in the effect of the depth here and we showed
before that simply increasing the width of the network can
sometimes even hurt performance, we decided to limit the
number of hidden units per layer to a maximum of 120. for recognition. For a fairer comparison, we also performed a
The results in Table IV indicate that the positions of the closed vocabulary recognition which led to WERs of 10.1%
max pooling layers only play a minor role, while increasing on the development set and 11.7% on the evaluation set.
the depth from two to a total of five blocks of convolution For RIMES, we trained the same network which achieved
and MDLSTM, i.e. ten hidden layers, greatly improves the the best result on IAM and used the same preprocessing (i.e.
results from a WER of 10.2% to 9.3% on the evaluation set. only deslanting). This setup yielded a WER of 11.3% on
However, the 12 layer network again performs worse. the RIMES evaluation set. Table VI compares this result to
In addition to changing the width and the depth of the previously published results on RIMES. The systems of the
network, we also tried to replace the tanh nonlinearity for other publications in the table are the same as for IAM.
the convolutional layers with rectified linear units and the It can be seen, that on both corpora our large MDLSTM
recently proposed exponential linear unit [29], but did not optical model achieves significant improvements both over
observe improvements. one-dimensional LSTM models and over previously used
From these experiments, it can be seen, that tuning the smaller MDLSTM models.
network topology, in particular the depth of the network,
VII. C ONCLUSION
is important to achieve good performance and just strongly
increasing the width does not lead to good results. Clearly, We presented our efficient GPU-based implementation of
there is still a lot room for improvement of the architecture. MDLSTM and showed that the network depth plays an im-
portant role for good performance. We trained deep networks
with up to ten hidden layers and achieved significant perfor-
C. Final Results mance improvements outperforming state of the art results
Our best result on IAM is achieved by the ten layer on two databases. However, we think that these experiments
network of the last subsection which yields a WER of are just the starting point for further progress in handwriting
7.1% on the development set and 9.3% on the evaluation
set. Table V compares our results to previously published Table VI: Comparison of the proposed system to results
results on IAM. Doetsch et al. [2] used a modified 1D reported by other groups on the RIMES evaluation set.
LSTM network architecture. Voigtlaender et al. [30] used
1D LSTM networks and applied a sequence-discriminative System WER[%] CER[%]
Our system 9.6 2.8
training criterion which could also be applied to our model Doetsch et al. [2] 12.9 4.3
for further improvements. Pham et al. [9] used a smaller Voigtlaender et al. [30] 12.1 4.4
MDLSTM optical model and no open vocabulary approach Pham et al. [9] 12.3 3.3
232
recognition. With the help of our implementation, many [15] U.-V. Marti and H. Bunke, “The IAM-database: an english sentence
database for offline handwriting recognition,” International Journal
more hyperparameters and novel architectural components of Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46, 2002.
like deep residual networks [13] can be quickly explored.
Additionally, our software provides a general framework [16] M. Kozielski, D. Rybach, S. Hahn, R. Schlüter, and H. Ney, “Open
vocabulary handwriting recognition using combined word-level and
for training, from which also other applications like speech character-level language models,” in Proceedings of the International
recognition or image segmentation can benefit in the future. Conference on Acoustics, Speech, and Signal Processing, May 2013,
pp. 8257–8261.
ACKNOWLEDGMENT
The authors would like to thank Mahdi Hamdani for help [17] E. Augustin, J.-m. Brodin, M. Carr, E. Geoffrois, E. Grosicki, and
F. Prłteux, “RIMES evaluation campaign for handwritten mail pro-
with the open vocabulary recognition setup. cessing,” in Proceedings of the Workshop on Frontiers in Handwriting
Recognition, no. 1, 2006.
R EFERENCES
[1] A. Graves, “Supervised sequence labelling with recurrent neural [18] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schlüter,
networks,” Ph.D. dissertation, Technical University Munich, 2008. and H. Ney, “RETURNN: The RWTH extensible training frame-
work for universal recurrent neural networks,” arXiv preprint
[2] P. Doetsch, M. Kozielski, and H. Ney, “Fast and robust training arXiv:1608.00895, 2016.
of recurrent neural networks for offline handwriting recognition,” in
International Conference on Frontiers in Handwriting Recognition, [19] A. Graves, “RNNLIB: A recurrent neural network library for
Sep. 2014, pp. 279–284. sequence learning problems.” [Online]. Available: http://sourceforge.
net/projects/rnnl/
[3] X. Li and X. Wu, “Constructing long short-term memory based deep
recurrent neural networks for large vocabulary speech recognition,” [20] A. V. D. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel
in 2015 IEEE International Conference on Acoustics, Speech and recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
Signal Processing, Apr. 2015, pp. 4520–4524.
[21] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Par-
[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural allel multi-dimensional lstm, with application to fast biomedical
Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. volumetric image segmentation,” in Advances in Neural Information
Processing Systems 28, 2015, pp. 2998–3006.
[5] A. Graves and J. Schmidhuber, “Offline handwriting recognition with
multidimensional recurrent neural networks,” in Advances in Neural [22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
Information Processing Systems 21, 2008, pp. 545–552. tion,” arXiv preprint arXiv:1412.6980, 2014.
[6] T. Strauß, T. Grüning, G. Leifert, and R. Labahn, “Citlab ARGUS for [23] T. Dozat, “Incorporating Nesterov momentum into Adam,” Stanford
historical handwritten documents,” arXiv preprint arXiv:1412.3949, University, Tech. Rep., 2015. [Online]. Available: http://cs229.
2014. stanford.edu/proj2015/054 report.pdf
[7] E. Grosicki and H. ElAbed, “ICDAR 2009 handwriting recognition [24] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to
competition,” in Proc. of the Int. Conf. on Document Analysis and the nonlinear dynamics of learning in deep linear neural networks,”
Recognition, 2009. arXiv preprint arXiv:1312.6120, 2013.
233