Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
369
Connectionist Temporal Classification
HMMs, hybrid systems do not exploit the full poten- 2.1. Label Error Rate
tial of RNNs for sequence modelling.
In this paper, we are interested in the following error
This paper presents a novel method for labelling se- measure: given a test set S 0 ⊂ DX ×Z disjoint from S,
quence data with RNNs that removes the need for pre- define the label error rate (LER) of a temporal clas-
segmented training data and post-processed outputs, sifier h as the mean normalised edit distance between
and models all aspects of the sequence within a single its classifications and the targets on S 0 , i.e.
network architecture. The basic idea is to interpret
the network outputs as a probability distribution over 1 X ED(h(x), z)
LER(h, S 0 ) = (1)
all possible label sequences, conditioned on a given in- |S 0 | |z|
(x,z)∈S 0
put sequence. Given this distribution, an objective
function can be derived that directly maximises the where ED(p, q) is the edit distance between two se-
probabilities of the correct labellings. Since the objec- quences p and q — i.e. the minimum number of inser-
tive function is differentiable, the network can then be tions, substitutions and deletions required to change p
trained with standard backpropagation through time into q.
(Werbos, 1990).
This is a natural measure for tasks (such as speech or
In what follows, we refer to the task of labelling un- handwriting recognition) where the aim is to minimise
segmented data sequences as temporal classification the rate of transcription mistakes.
(Kadous, 2002), and to our use of RNNs for this pur-
pose as connectionist temporal classification (CTC). 3. Connectionist Temporal Classification
By contrast, we refer to the independent labelling of
each time-step, or frame, of the input sequence as This section describes the output representation that
framewise classification. allows a recurrent neural network to be used for CTC.
The crucial step is to transform the network outputs
The next section provides the mathematical formalism
into a conditional probability distribution over label
for temporal classification, and defines the error mea-
sequences. The network can then be used a classifier
sure used in this paper. Section 3 describes the output
by selecting the most probable labelling for a given
representation that allows RNNs to be used as tempo-
input sequence.
ral classifiers. Section 4 explains how CTC networks
can be trained. Section 5 compares CTC to hybrid and
HMM systems on the TIMIT speech corpus. Section 6 3.1. From Network Outputs to Labellings
discusses some key differences between CTC and other A CTC network has a softmax output layer (Bridle,
temporal classifiers, giving directions for future work, 1990) with one more unit than there are labels in L.
and the paper concludes with section 7. The activations of the first |L| units are interpreted as
the probabilities of observing the corresponding labels
2. Temporal Classification at particular times. The activation of the extra unit
is the probability of observing a ‘blank’, or no label.
Let S be a set of training examples drawn from a fixed Together, these outputs define the probabilities of all
distribution DX ×Z . The input space X = (Rm )∗ is possible ways of aligning all possible label sequences
the set of all sequences of m dimensional real val- with the input sequence. The total probability of any
ued vectors. The target space Z = L∗ is the set one label sequence can then be found by summing the
of all sequences over the (finite) alphabet L of la- probabilities of its different alignments.
bels. In general, we refer to elements of L∗ as label
sequences or labellings. Each example in S consists More formally, for an input sequence x of length T ,
of a pair of sequences (x, z). The target sequence define a recurrent neural network with m inputs, n
z = (z1 , z2 , ..., zU ) is at most as long as the input outputs and weight vector w as a continuous map Nw :
sequence x = (x1 , x2 , ..., xT ), i.e. U ≤ T . Since the (Rm )T 7→ (Rn )T . Let y = Nw (x) be the sequence of
input and target sequences are not generally the same network outputs, and denote by ykt the activation of
length, there is no a priori way of aligning them. output unit k at time t. Then ykt is interpreted as the
probability of observing label k at time t, which defines
T
The aim is to use S to train a temporal classifier a distribution over the set L0 of length T sequences
h : X 7→ Z to classify previously unseen input se- over the alphabet L0 = L ∪ {blank}:
quences in a way that minimises some task specific
T
error measure. Y T
p(π|x) = yπt t , ∀π ∈ L0 . (2)
t=1
370
Connectionist Temporal Classification
Waveform
1
label probability
Framewise
0
1
CTC
0
dh ax s aw n d ix v
dcl
"the" "sound" "of "
Figure 1. Framewise and CTC networks classifying a speech signal. The shaded lines are the output activations,
corresponding to the probabilities of observing phonemes at particular times. The CTC network predicts only the
sequence of phonemes (typically as a series of spikes, separated by ‘blanks’, or null predictions), while the framewise
network attempts to align them with the manual segmentation (vertical lines). The framewise network receives an error
for misaligning the segment boundaries, even if it predicts the correct phoneme (e.g. ‘dh’). When one phoneme always
occurs beside another (e.g. the closure ‘dcl’ with the stop ‘d’), CTC tends to predict them together in a double spike.
The choice of labelling can be read directly from the CTC outputs (follow the spikes), whereas the predictions of the
framewise network must be post-processed before use.
T
From now on, we refer to the elements of L0 as paths, Using the terminology of HMMs, we refer to the task of
and denote them π. finding this labelling as decoding. Unfortunately, we do
not know of a general, tractable decoding algorithm for
Implicit in (2) is the assumption that the network out-
our system. However the following two approximate
puts at different times are conditionally independent,
methods give good results in practice.
given the internal state of the network. This is ensured
by requiring that no feedback connections exist from The first method (best path decoding) is based on the
the output layer to itself or the network. assumption that the most probable path will corre-
spond to the most probable labelling:
The next step is to define a many-to-one map B :
T
L0 7→ L≤T , where L≤T is the set of possible labellings h(x) ≈ B(π ∗ ) (4)
(i.e. the set of sequences of length less than or equal to where π ∗ = arg maxt p(π|x).
T over the original label alphabet L). We do this by π∈N
simply removing all blanks and repeated labels from Best path decoding is trivial to compute, since π ∗ is
the paths (e.g. B(a − ab−) = B(−aa − −abb) = aab). just the concatenation of the most active outputs at
Intuitively, this corresponds to outputting a new label every time-step. However it is not guaranteed to find
when the network switches from predicting no label the most probable labelling.
to predicting a label, or from predicting one label to
another (c.f. the CTC outputs in figure 1). Finally, we The second method (prefix search decoding) relies on
use B to define the conditional probability of a given the fact that, by modifying the forward-backward al-
labelling l ∈ L≤T as the sum of the probabilities of all gorithm of section 4.1, we can efficiently calculate the
the paths corresponding to it: probabilities of successive extensions of labelling pre-
fixes (figure 2).
X
p(l|x) = p(π|x). (3) Given enough time, prefix search decoding always finds
π∈B−1 (l) the most probable labelling. However, the maximum
number of prefixes it must expand grows exponentially
3.2. Constructing the Classifier with the input sequence length. If the output distri-
bution is sufficiently peaked around the mode, it will
Given the above formulation, the output of the classi- nonetheless finish in reasonable time. For the exper-
fier should be the most probable labelling for the input iment in this paper though, a further heuristic was
sequence: required to make its application feasible.
h(x) = arg max p(l|x). Observing that the outputs of a trained CTC network
l∈L≤T
371
Connectionist Temporal Classification
is above a certain threshold. We then calculate the As we will see, αt (s) can be calculated recursively from
most probable labelling for each section individually αt−1 (s) and αt−1 (s − 1).
and concatenate these to get the final classification.
To allow for blanks in the output paths, we consider
In practice, prefix search works well with this heuristic, a modified label sequence l0 , with blanks added to the
and generally outperforms best path decoding. How- beginning and the end and inserted between every pair
ever it does fail in some cases, e.g. if the same label is of labels. The length of l0 is therefore 2|l| + 1. In cal-
predicted weakly on both sides of a section boundary. culating the probabilities of prefixes of l0 we allow all
transitions between blank and non-blank labels, and
4. Training the Network also those between any pair of distinct non-blank la-
bels. We allow all prefixes to start with either a blank
So far we have described an output representation that (b) or the first symbol in l (l1 ).
allows RNNs to be used for CTC. We now derive an
objective function for training CTC networks with gra- This gives us the following rules for initialisation
dient descent.
α1 (1) = yb1
The objective function is derived from the principle of α1 (2) = yl11
maximum likelihood. That is, minimising it maximises
α1 (s) = 0, ∀s > 2
the log likelihoods of the target labellings. Note that
this is the same principle underlying the standard neu- and recursion
ral network objective functions (Bishop, 1995). Given (
the objective function, and its derivatives with re- ᾱt (s)ylts0 if ls0 = b or ls−2
0
= ls0
αt (s) = t
spect to the network outputs, the weight gradients can ᾱt (s) + αt−1 (s − 2) yls0 otherwise
be calculated with standard backpropagation through (6)
time. The network can then be trained with any of where
def
the gradient-based optimisation algorithms currently ᾱt (s) = αt−1 (s) + αt−1 (s − 1). (7)
372
Connectionist Temporal Classification
Figure 3. illustration of the forward backward algo- and substitute β for β̂ on the RHS of (10) and (11).
rithm applied to the labelling ‘CAT’. Black circles To evaluate the maximum likelihood error, we need
represent labels, and white circles represent blanks. Arrows the natural logs of the target labelling probabilities.
signify allowed transitions. Forward variables are updated
With the rescaled variables these have a particularly
in the direction of the arrows, and backward variables are
updated against them.
simple form:
T
X
ln(p(l|x)) = ln(Ct )
Note that αt (s) = 0 ∀s < |l0 | − 2(T − t) − 1, because t=1
these variables correspond to states for which there are
not enough time-steps left to complete the sequence 4.2. Maximum Likelihood Training
(the unconnected circles in the top right of figure 3).
The aim of maximum likelihood training is to simul-
Also αt (s) = 0 ∀s < 1.
taneously maximise the log probabilities of all the cor-
The probability of l is then the sum of the total prob- rect classifications in the training set. In our case, this
abilities of l0 with and without the final blank at time means minimising the following objective function:
T. X
OM L (S, Nw ) = −
p(l|x) = αT (|l0 |) + αT (|l0 | − 1). (8) ln p(z|x) (12)
(x,z)∈S
Similarly, the backward variables βt (s) are defined as
To train the network with gradient descent, we need to
the total probability of ls:|l| at time t.
differentiate (12) with respect to the network outputs.
T
Since the training examples are independent we can
0
consider them separately:
def
X Y
βt (s) = yπt t0 (9)
π∈N T : t0 =t ∂OM L ({(x, z)}, Nw ) ∂ln(p(z|x))
B(πt:T )=ls:|l|
t =− (13)
∂yk ∂ykt
We now show how the algorithm of section 4.1 can be
βT (|l0 |) = ybT used to calculate (13).
βT (|l0 | − 1) = ylT|l| The key point is that, for a labelling l, the product of
βT (s) = 0, ∀s < |l0 | − 1 the forward and backward variables at a given s and
t is the probability of all the paths corresponding to l
( that go through the symbol s at time t. More precisely,
β̄t (s)ylts0 if ls0 = b or ls+2
0
= ls0
βt (s) = t from (5) and (9) we have:
β̄t (s) + βt+1 (s + 2) yls0 otherwise
T
(10) X Y
αt (s)βt (s) = ylts yπt t .
where −1
def π∈B (l): t=1
β̄t (s) = βt+1 (s) + βt+1 (s + 1). (11) πt =ls
βt (s) = 0 ∀s > 2t (the unconnected circles in the bot- Rearranging and substituting in from (2) gives
tom left of figure 3) and ∀s > |l0 |.
αt (s)βt (s) X
= p(π|x).
In practice, the above recursions will soon lead to un- ylts
π∈B−1 (l):
derflows on any digital computer. One way of avoid- πt =ls
373
Connectionist Temporal Classification
From (3) we can see that this is the portion of the total
probability p(l|x) due to those paths going through ls
(a)
at time t. We can therefore sum over all s and t to
get:
T X |l|
X αt (s)βt (s)
p(l|x) = . (14)
ylts (b)
t=1 s=1
374
Connectionist Temporal Classification
375
Connectionist Temporal Classification
tion with the corresponding part of the sequence. This lationships to statistical pattern recognition. In
should make it suitable for tasks like keyword spotting, F. Soulie and J.Herault (Eds.), Neurocomputing: Al-
where approximate segmentation is sufficient. gorithms, architectures and applications, 227–236.
Springer-Verlag.
Another distinctive feature of CTC is that it does
not explicitly model inter-label dependencies. This is Gers, F., Schraudolph, N., & Schmidhuber, J. (2002).
in contrast to graphical models, where the labels are Learning precise timing with LSTM recurrent net-
typically assumed to form a kth order Markov chain. works. Journal of Machine Learning Research, 3,
Nonetheless, CTC implicitly models inter-label depen- 115–143.
dencies, e.g. by predicting labels that commonly occur Graves, A., Fernández, S., & Schmidhuber, J.
together as a double spike (see figure 1). (2005). Bidirectional LSTM networks for improved
phoneme classification and recognition. Proceedings
One very general way of dealing with structured data
of the 2005 International Conference on Artificial
would be a hierarchy of temporal classifiers, where the
Neural Networks. Warsaw, Poland.
labellings at one level (e.g. letters) become inputs for
the labellings at the next (e.g. words). Preliminary Graves, A., & Schmidhuber, J. (2005). Framewise
experiments with hierarchical CTC have been encour- phoneme classification with bidirectional LSTM and
aging, and we intend to pursue this direction further. other neural network architectures. Neural Net-
works, 18, 602–610.
Good generalisation is always difficult with maximum
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-
likelihood training, but appears to be particularly so
Term Memory. Neural Computation, 9, 1735–1780.
for CTC. In the future, we will continue to explore
methods to reduce overfitting, such as weight decay, Kadous, M. W. (2002). Temporal classification: Ex-
boosting and margin maximisation. tending the classification paradigm to multivariate
time series. Doctoral dissertation, School of Com-
puter Science & Engineering, University of New
7. Conclusions South Wales.
We have introduced a novel, general method for tem- Lafferty, J., McCallum, A., & Pereira, F. (2001). Con-
poral classification with RNNs. Our method fits nat- ditional random fields: Probabilistic models for seg-
urally into the existing framework of neural network menting and labeling sequence data. Proc. 18th In-
classifiers, and is derived from the same probabilis- ternational Conf. on Machine Learning (pp. 282–
tic principles. It obviates the need for pre-segmented 289). Morgan Kaufmann, San Francisco, CA.
data, and allows the network to be trained directly for LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998).
sequence labelling. Moreover, without requiring any Efficient backprop. Neural Networks: Tricks of the
task-specific knowledge, it has outperformed both an trade. Springer.
HMM and an HMM-RNN hybrid on a real-world tem-
Rabiner, L. R. (1989). A tutorial on hidden markov
poral classification problem.
models and selected applications in speech recogni-
tion. Proc. IEEE (pp. 257–286). IEEE.
Acknowledgements Robinson, A. J. (1991). Several improvements
We thank Marcus Hutter for useful mathematical dis- to a recurrent error propagation network phone
cussions. This research was funded by SNF grants recognition system (Technical Report CUED/F-
200021-111968/1 and 200020-107534/1. INFENG/TR82). University of Cambridge.
Robinson, A. J. (1994). An application of recurrent
References nets to phone probability estimation. IEEE Trans-
actions on Neural Networks, 5, 298–305.
Bengio., Y. (1999). Markovian models for sequential
Schraudolph, N. N. (2002). Fast Curvature Matrix-
data. Neural Computing Surveys, 2, 129–162.
Vector Products for Second-Order Gradient De-
Bishop, C. (1995). Neural Networks for Pattern Recog- scent. Neural Comp., 14, 1723–1738.
nition, chapter 6. Oxford University Press, Inc. Schuster, M., & Paliwal, K. K. (1997). Bidirectional
Bourlard, H., & Morgan, N. (1994). Connnectionist recurrent neural networks. IEEE Transactions on
speech recognition: A hybrid approach. Kluwer Aca- Signal Processing, 45, 2673–2681.
demic Publishers. Werbos, P. (1990). Backpropagation through time:
Bridle, J. (1990). Probabilistic interpretation of feed- What it does and how to do it. Proceedings of the
forward classification network outputs, with re- IEEE, 78, 1550 – 1560.
376