Neural Net
Neural Net
Neural Net
Neural net
arXiv
[email protected]
Pavel Kuksa‡ [email protected]
NEC Labs America, Princeton NJ.
v
Abstract
We propose a unified neural network architecture and learning algorithm that can be applied
to various natural language processing tasks including: part-of-speech tagging, chunking,
Xi
named entity recognition, and semantic role labeling. This versatility is achieved by trying
to avoid task-specific engineering and therefore disregarding a lot of prior knowledge.
Instead of exploiting man-made input features carefully optimized for each task, our system
learns internal representations on the basis of vast amounts of mostly unlabeled training
data. This work is then used as a basis for building a freely available tagging system with
good performance and minimal computational requirements.
Keywords: Natural Language Processing, Neural Networks
1. Introduction
ar
Will a computer program ever be able to convert a piece of English text into a data structure
that unambiguously and completely describes the meaning of the natural language text?
Among numerous problems, no consensus has emerged about the form of such a data
structure. Until such fundamental Artificial Intelligence problems are resolved, computer
scientists must settle for reduced objectives: extracting simpler representations describing
restricted aspects of the textual information.
These simpler representations are often motivated by specific applications, for instance,
bag-of-words variants for information retrieval. These representations can also be motivated
by our belief that they capture something more general about natural language. They
can describe syntactic information (e.g. part-of-speech tagging, chunking, and parsing) or
semantic information (e.g. word-sense disambiguation, semantic role labeling, named entity
extraction, and anaphora resolution). Text corpora have been manually annotated with such
data structures in order to compare the performance of various systems. The availability of
standard benchmarks has stimulated research in Natural Language Processing (NLP) and
†. Koray Kavukcuoglu is also with New York University, New York, NY.
‡. Pavel Kuksa is also with Rutgers University, New Brunswick, NJ.
2009
c Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu and Pavel Kuksa.
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
effective systems have been designed for all these tasks. Such systems are often viewed as
software components for constructing real-world NLP solutions.
The overwhelming majority of these state-of-the-art systems address a benchmark
task by applying linear statistical models to ad-hoc features. In other words, the
researchers themselves discover intermediate representations by engineering task-specific
features. These features are often derived from the output of preexisting systems, leading
to complex runtime dependencies. This approach is effective because researchers leverage
a large body of linguistic knowledge. On the other hand, there is a great temptation to
optimize the performance of a system for a specific benchmark. Although such performance
improvements can be very useful in practice, they teach us little about the means to progress
toward the broader goals of natural language understanding and the elusive goals of Artificial
Intelligence.
In this contribution, we try to excel on multiple benchmarks while avoiding task-specific
enginering. Instead we use a single learning system able to discover adequate internal
representations. In fact we view the benchmarks as indirect measurements of the relevance
v
of the internal representations discovered by the learning procedure, and we posit that these
intermediate representations are more general than any of the benchmarks. Our desire to
avoid task-specific engineered features led us to ignore a large body of linguistic knowledge.
Instead we reach good performance levels in most of the tasks by transferring intermediate
Xi
representations discovered on large unlabeled datasets. We call this approach “almost from
scratch” to emphasize the reduced (but still important) reliance on a priori NLP knowledge.
The paper is organized as follows. Section 2 describes the benchmark tasks of
interest. Section 3 describes the unified model and reports benchmark results obtained with
supervised training. Section 4 leverages large unlabeled datasets (∼ 852 million words)
to train the model on a language modeling task. Performance improvements are then
demonstrated by transferring the unsupervised internal representations into the supervised
benchmark models. Section 5 investigates multitask supervised training. Section 6 then
evaluates how much further improvement can be achieved by incorporating standard NLP
ar
task-specific engineering into our systems. Drifting away from our initial goals gives us the
opportunity to construct an all-purpose tagger that is simultaneously accurate, practical,
and fast. We then conclude with a short discussion section.
2
Natural Language Processing (almost) from Scratch
Table 1: Experimental setup: for each task, we report the standard benchmark we used,
the dataset it relates to, as well as training and test information.
v
Shen et al. (2007) 97.33% Shen and Sarkar (2005) 95.23%
Toutanova et al. (2003) 97.24% Sha and Pereira (2003) 94.29%
Giménez and Màrquez (2004) 97.16% Kudo and Matsumoto (2001) 93.91%
(a) POS (b) CHUNK
Xi
System
Ando and Zhang (2005)
Florian et al. (2003)
F1
89.31%
88.76%
System
Koomen et al. (2005)
Pradhan et al. (2005)
F1
77.92%
77.30%
Kudo and Matsumoto (2001) 88.31% Haghighi et al. (2005) 77.04%
(c) NER (d) SRL
et al. (2003). Sections 0–18 of Wall Street Journal (WSJ) data are used for training, while
sections 19–21 are for validation and sections 22–24 for testing.
The best POS classifiers are based on classifiers trained on windows of text, which are
then fed to a bidirectional decoding algorithm during inference. Features include preceding
and following tag context as well as multiple words (bigrams, trigrams. . . ) context, and
handcrafted features to deal with unknown words. Toutanova et al. (2003), who use
maximum entropy classifiers, and a bidirectional dependency network (Heckerman et al.,
2001) at inference, reach 97.24% per-word accuracy. Giménez and Màrquez (2004) proposed
a SVM approach also trained on text windows, with bidirectional inference achieved with
two Viterbi decoders (left-to-right and right-to-left). They obtained 97.16% per-word
accuracy. More recently, Shen et al. (2007) pushed the state-of-the-art up to 97.33%,
with a new learning algorithm they call guided learning, also for bidirectional sequence
classification.
3
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
2.2 Chunking
Also called shallow parsing, chunking aims at labeling segments of a sentence with syntactic
constituents such as noun or verb phrases (NP or VP). Each word is assigned only one unique
tag, often encoded as a begin-chunk (e.g. B-NP) or inside-chunk tag (e.g. I-NP). Chunking
is often evaluated using the CoNLL 2000 shared task1 . Sections 15–18 of WSJ data are
used for training and section 20 for testing. Validation is achieved by splitting the training
set.
Kudoh and Matsumoto (2000) won the CoNLL 2000 challenge on chunking with a F1-
score of 93.48%. Their system was based on Support Vector Machines (SVMs). Each
SVM was trained in a pairwise classification manner, and fed with a window around the
word of interest containing POS and words as features, as well as surrounding tags. They
perform dynamic programming at test time. Later, they improved their results up to
93.91% (Kudo and Matsumoto, 2001) using an ensemble of classifiers trained with different
tagging conventions (see Section 3.2.3).
Since then, a certain number of systems based on second-order random fields were
v
reported (Sha and Pereira, 2003; McDonald et al., 2005; Sun et al., 2008), all reporting
around 94.3% F1 score. These systems use features composed of words, POS tags, and
tags.
More recently, Shen and Sarkar (2005) obtained 95.23% using a voting classifier scheme,
Xi
where each classifier is trained on different tag representations2 (IOB, IOE, . . . ). They use
POS features coming from an external tagger, as well carefully hand-crafted specialization
features which again change the data representation by concatenating some (carefully
chosen) chunk tags or some words with their POS representation. They then build trigrams
over these features, which are finally passed through a Viterbi decoder a test time.
4
Natural Language Processing (almost) from Scratch
corpus was 27M words taken from Reuters. Features included words, POS tags, suffixes
and prefixes or CHUNK tags, but overall were less specialized than CoNLL 2003 challengers.
v
State-of-the-art SRL systems consist of several stages: producing a parse tree, identifying
which parse tree nodes represent the arguments of a given verb, and finally classifying these
nodes to compute the corresponding SRL tags. This entails extracting numerous base
features from the parse tree and feeding them into statistical models. Feature categories
Xi
commonly used by these system include (Gildea and Jurafsky, 2002; Pradhan et al., 2004):
• the parts of speech and syntactic labels of words and nodes in the tree;
• the node’s position (left or right) in relation to the verb;
• the syntactic path to the verb in the parse tree;
• whether a node in the parse tree is part of a noun or verb phrase;
• the voice of the sentence: active or passive;
ar
• the node’s head word; and
• the verb sub-categorization.
Pradhan et al. (2004) take these base features and define additional features, notably
the part-of-speech tag of the head word, the predicted named entity class of the argument,
features providing word sense disambiguation for the verb (they add 25 variants of 12 new
feature types overall). This system is close to the state-of-the-art in performance. Pradhan
et al. (2005) obtain 77.30% F1 with a system based on SVM classifiers and simultaneously
using the two parse trees provided for the SRL task. In the same spirit, Haghighi et al.
(2005) use log-linear models on each tree node, re-ranked globally with a dynamic algorithm.
Their system reaches 77.04% using the five top Charniak parse trees.
Koomen et al. (2005) hold the state-of-the-art with Winnow-like (Littlestone, 1988)
classifiers, followed by a decoding stage based on an integer program that enforces specific
constraints on SRL tags. They reach 77.92% F1 on CoNLL 2005, thanks to the five top
parse trees produced by the Charniak (2000) parser (only the first one was provided by the
contest) as well as the Collins (1999) parse tree.
4. See http://www.lsi.upc.edu/~srlconll.
5
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
2.5 Evaluation
In our experiments, we strictly followed the standard evaluation procedure of each CoNLL
challenges for NER, CHUNK and SRL. All these three tasks are evaluated by computing the
F1 scores over chunks produced by our models. The POS task is evaluated by computing
the per-word accuracy, as it is the case for the standard benchmark we refer to (Toutanova
et al., 2003). We picked the conlleval script5 for evaluating POS6 , NER and CHUNK.
For SRL, we used the srl-eval.pl script included in the srlconll package7 .
2.6 Discussion
When participating in an (open) challenge, it is legitimate to increase generalization by all
means. It is thus not surprising to see many top CoNLL systems using external labeled data,
like additional NER classifiers for the NER architecture of Florian et al. (2003) or additional
parse trees for SRL systems (Koomen et al., 2005). Combining multiple systems or tweaking
carefully features is also a common approach, like in the chunking top system (Shen and
v
Sarkar, 2005).
However, when comparing systems, we do not learn anything of the quality of each
system if they were trained with different labeled data. For that reason, we will refer to
benchmark systems, that is, top existing systems which avoid usage of external data and
have been well-established in the NLP field: (Toutanova et al., 2003) for POS and (Sha and
Xi
Pereira, 2003) for chunking. For NER we consider (Ando and Zhang, 2005) as they were
using additional unlabeled data only. We picked (Koomen et al., 2005) for SRL, keeping in
mind they use 4 additional parse trees not provided by the challenge. These benchmark
systems will serve as baseline references in our experiments. We marked them in bold
in Table 2.
We note that for the four tasks we are considering in this work, it can be seen that for the
more complex tasks (with corresponding lower accuracies), the best systems proposed have
more engineered features relative to the best systems on the simpler tasks. That is, the POS
ar
task is one of the simplest of our four tasks, and only has relatively few engineered features,
whereas SRL is the most complex, and many kinds of features have been designed for it.
This clearly has implications for as yet unsolved NLP tasks requiring more sophisticated
semantic understanding than the ones considered here.
3. The Networks
All the NLP tasks above can be seen as tasks assigning labels to words. The traditional NLP
approach is: extract from the sentence a rich set of hand-designed features which are then
fed to a standard classification algorithm, e.g. a Support Vector Machine (SVM), often with
a linear kernel. The choice of features is a completely empirical process, mainly based first
on linguistic intuition, and then trial and error, and the feature selection is task dependent,
implying additional research for each new NLP task. Complex tasks like SRL then require
a large number of possibly complex features (e.g., extracted from a parse tree) which can
5. Available at http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.
6. We used the “-r” option of the conlleval script to get the per-word accuracy, for POS only.
7. Available at http://www.lsi.upc.es/~srlconll/srlconll-1.1.tgz.
6
Natural Language Processing (almost) from Scratch
Lookup Table
LTW 1
.. d
.
LTW K
concat
Linear
M1 × ·
n1
v
hu
HardTanh
Linear
Xi M2 × ·
n2
hu = #tags
impact the computational cost which might be important for large-scale applications or
applications requiring real-time response.
Instead, we advocate a radically different approach: as input we will try to pre-process
ar
our features as little as possible and then use a multilayer neural network (NN) architecture,
trained in an end-to-end fashion. The architecture takes the input sentence and learns
several layers of feature extraction that process the inputs. The features computed by the
deep layers of the network are automatically trained by backpropagation to be relevant to
the task. We describe in this section a general multilayer architecture suitable for all our
NLP tasks, which is generalizable to other NLP tasks as well.
Our architecture is summarized in Figure 1 and Figure 2. The first layer extracts features
for each word. The second layer extracts features from a window of words or from the whole
sentence, treating it as a sequence with local and global structure (i.e., it is not treated like
a bag of words). The following layers are standard NN layers.
Notations We consider a neural network fθ (·), with parameters θ. Any feed-forward
neural network with L layers, can be seen as a composition of functions fθl (·), corresponding
to each layer l:
fθ (·) = fθL (fθL−1 (. . . fθ1 (·) . . .)) .
In the following, we will describe each layer we use in our networks shown in Figure 1
and Figure 2. We adopt few notations. Given a matrix A we denote [A]i, j the coefficient
7
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Input Sentence
Text The cat sat on the mat
1
Feature 1 w11 w21 . . . wN
Padding
Padding
..
.
K
Feature K w1K w2K . . . wN
Lookup Table
LTW 1
.. d
.
LTW K
Convolution
M1 × ·
v
n1
hu
Xi Max Over Time
max(·)
n1
hu
Linear
M2 × ·
n2
hu
ar
HardTanh
Linear
M3 × ·
n3
hu = #tags
at row i and column j in the matrix. We also denote hAidi win the vector obtained by
concatenating the dwin column vectors around the ith column vector of matrix A ∈ Rd1 ×d2 :
h iT
hAidi win = [A]1, i−dwin /2 . . . [A]d1 , i−dwin /2 , . . . , [A]1, i+dwin /2 . . . [A]d1 , i+dwin /2 .
As a special case, hAi1i represents the ith column of matrix A. For a vector v, we denote
[v]i the scalar at index i in the vector. Finally, a sequence of element {x1 , x2 , . . . , xT } is
written [x]T1 . The ith element of the sequence is [x]i .
8
Natural Language Processing (almost) from Scratch
where W ∈ Rdwrd ×|D| is a matrix of parameters to be learnt, hW i1w ∈ Rdwrd is the wth
v
column of W and dwrd is the word vector size (a hyper-parameter to be chosen by the user).
Given a sentence or any sequence of T words [w]T1 in D, the lookup table layer applies the
same operation for each word in the sequence, producing the following output matrix:
LTW ([w]T1 ) = hW i1[w]1 hW i1[w]2 . . . hW i1[w]T
Xi .
This matrix can then be fed to further neural network layers, as we will see below.
(1)
hW 1 i1w1
LTW 1 (w1 )
LTW 1 ,...,W K (w) = .. ..
= .
. .
LTW K (wK ) hW K i1wK
8. We did some pre-processing, namely lowercasing and encoding capitalization as another feature. With
enough (unlabeled) training data, presumably we could learn a model without this processing. Ideally,
an even more raw input would be to learn from letter sequences rather than words, however we felt that
this was beyond the scope of this work.
9
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
The matrix output of the lookup table layer for a sequence of words [w]T1 is then similar
to (1), but where extra rows have been added for each discrete feature:
hW 1 i1[w1 ] . . . hW 1 i1[w1 ]
1 T
These vector features in the lookup table effectively learn features for words in the dictionary.
Now, we want to use these trainable features as input to further layers of trainable feature
extractors, that can represent groups of words and then finally sentences.
v
tags for each element in variable length sequences (here, a sentence is a sequence of words)
is a standard problem in machine-learning. We consider two common approaches which tag
one word at the time: a window approach, and a (convolutional) sentence approach.
Linear Layer The fixed size vector fθ1 can be fed to one or several standard neural
network layers which perform affine transformations over their inputs:
10
Natural Language Processing (almost) from Scratch
would be a simple linear model. We chose a “hard” version of the hyperbolic tangent as non-
linearity. It has the advantage of being slightly cheaper to compute (compared to the exact
hyperbolic tangent), while leaving the generalization performance unchanged (Collobert,
2004). The corresponding layer l applies a HardTanh over its input vector:
h i h i
fθl = HardTanh( fθl−1 ) ,
i i
where
−1 if x < −1
HardTanh(x) = x if − 1 <= x <= 1 . (5)
1 if x > 1
Scoring Finally, the output size of the last layer L of our network is equal to the number
of possible tags for the task of interest. Each output can be then interpreted as a score of
the corresponding tag (given the input of the network), thanks to a carefully chosen cost
function that we will describe later in this section.
v
Remark 1 (Border Effects) The feature window (3) is not well defined for words near
the beginning or the end of a sentence. To circumvent this problem, we augment the sentence
with a special “PADDING” word replicated dwin /2 times at the beginning and the end. This
Xi
is akin to the use of “start” and “stop” symbols in sequence models.
11
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
Th
pr
ch pos
al nge d
w
al ld
ex w
to cut
re
ex rt
of rcis
op
la ns
an r
le
of
. n
Th
pr
ch os
al nge
w
al ld
ex w
to cut
re
ex rt
of rcis
op
la ns
an r
le
of
. n
ou
ou
so s
lo
te
ss
so s
lo
te
ss
te
op
po
po
te
o
a e
a ed
e
tio
tio
d
e
e
iv
iv
es
es
es
es
Figure 3: Number of features chosen at each word position by the Max layer. We consider
a sentence approach network (Figure 2) trained for SRL. The number of “local” features
output by the convolution layer is 300 per word. By applying a Max over the sentence,
we obtain 300 features for the whole sentence. It is interesting to see that the network
catches features mostly around the verb of interest (here “report”) and word of interest
(“proposed” (left) or “often” (right)).
v
windows in the sequence. Using previous notations, the tth output column of the lth layer
can be computed as:
hfθl i1t = W l hfθl−1 idt win + bl ∀t ,
Xi (6)
where the weight matrix W l is the same across all windows t in the sequence. Convolutional
layers extract local features around each window of the given sequence. As for standard
affine layers (4), convolutional layers are often stacked to extract higher level features.
In this case, each layer must be followed by a non-linearity (5) or the network would be
equivalent to one convolutional layer.
Max Layer The size of the output (6) depends on the number of words in the sentence
fed to the network. Local feature vectors extracted by the convolutional layers have to be
ar
combined to obtain a global feature vector, with a fixed size independent of the sentence
length, in order to apply subsequent standard affine layers. Traditional convolutional
networks often apply an average (possibly weighted) or a max operation over the “time” t
of the sequence (6). (Here, “time” just means the position in the sentence, this term stems
from the use of convolutional layers in e.g. speech data where the sequence occurs over
time.) The average operation does not make much sense in our case, as in general most
words in the sentence do not have any influence on the semantic role of a given word to tag.
Instead, we used a max approach, which forces the network to capture the most useful local
features produced by the convolutional layers (see Figure 3), for the task at hand. Given a
matrix fθl−1 output by a convolutional layer l − 1, the Max layer l outputs a vector fθl :
h i h i
fθl = max fθl−1 1 ≤ i ≤ nl−1
hu . (7)
i t i, t
This fixed sized global feature vector can be then fed to standard affine network layers (4).
As in the window approach, we then finally produce one score per possible tag for the given
task.
12
Natural Language Processing (almost) from Scratch
Table 3: Various tagging schemes. Each word in a segment labeled “X” is tagged with a
prefixed label, depending of the word position in the segment (begin, inside, end). Single
word segment labeling is also output. Words not in a labeled segment are labeled “O”.
Variants of the IOB (and IOE) scheme exist, where the prefix B (or E) is replaced by I for
all segments not contiguous with another segment having the same label “X”.
Remark 2 The same border effects arise in the convolution operation (6) as in the window
approach (3). We again work around this problem by padding the sentences with a special
word.
v
3.2.3 Tagging Schemes
As explained earlier, the network output layers compute scores for all the possible tags for
the task of interest. In the window approach, these tags apply to the word located in the
Xi
center of the window. In the (convolutional) sentence approach, these tags apply to the
word designated by additional markers in the network input.
The POS task indeed consists of marking the syntactic role of each word. However, the
remaining three tasks associate labels with segments of a sentence. This is usually achieved
by using special tagging schemes to identify the segment boundaries, as shown in Table 3.
Several such schemes have been defined (IOB, IOE, IOBES, . . . ) without clear conclusion
as to which scheme is better in general. State-of-the-art performance is sometimes obtained
by combining classifiers trained with different tagging schemes (e.g. Kudo and Matsumoto,
2001).
ar
The ground truth for the NER, CHUNK, and SRL tasks is provided using two different
tagging schemes. In order to eliminate this additional source of variations, we have decided
to use the most expressive IOBES tagging scheme for all tasks. For instance, in the CHUNK
task, we describe noun phrases using four different tags. Tag “S-NP” is used to mark a noun
phrase containing a single word. Otherwise tags “B-NP”, “I-NP”, and “E-NP” are used
to mark the first, intermediate and last words of the noun phrase. An additional tag “O”
marks words that are not members of a chunk. During testing, these tags are then converted
to the original IOB tagging scheme and fed to the standard performance evaluation scripts
mentioned in Section 2.5.
3.3 Training
All our neural networks are trained by maximizing a likelihood over the training data, using
stochastic gradient ascent. If we denote θ to be all the trainable parameters of the network,
which are trained using a training set T we want to maximize the following log-likelihood
13
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
with respect to θ: X
θ 7→ log p(y | x, θ) , (8)
(x, y)∈T
where x corresponds to either a training word window or a sentence and its associated
features, and y represents the corresponding tag. The probability p(·) is computed from the
outputs of the neural network. We will see in this section two ways of interpreting neural
network outputs as probabilities.
v
e[fθ ]i
p(i | x, θ) = P . (9)
[fθ ]j
je
Xi
Defining the log-add operation as
logadd zi = log(
i
X
ezi ) ,
i
(10)
we can express the log-likelihood for one training example (x, y) as follows:
14
Natural Language Processing (almost) from Scratch
We introduce a transition score [A]i, j for jumping from i to j tags in successive words, and
an initial score [A]i, 0 for starting from the ith tag. As the transition scores are going to be
trained (as are all network parameters θ), we define θ̃ = θ ∪ {[A]i, j ∀i, j}. The score of
a sentence [x]T1 along a path of tags [i]T1 is then given by the sum of transition scores and
network scores:
T
X
s([x]T1 , [i]T1 , θ̃) = [A][i] , [i] + [fθ ][i] , t . (12)
t−1 t t
t=1
Exactly as for the word-level likelihood (11), where we were normalizing with respect to all
tags using a softmax (9), we normalize this score over all possible tag paths [j]T1 using a
softmax, and we interpret the resulting ratio as a conditional tag path probability. Taking
the log, the conditional probability of the true path [y]T1 is therefore given by:
log p([y]T1 | [x]T1 , θ̃) = s([x]T1 , [y]T1 , θ̃) − logadd s([x]T1 , [j]T1 , θ̃) . (13)
∀[j]T
1
v
While the number of terms in the logadd operation (11) was equal to the number of tags, it
grows exponentially with the length of the sentence in (13). Fortunately, one can compute
it in linear time with the following standard recursion over t, taking advantage of the
associativity and distributivity on the semi-ring9 (R ∪ {−∞}, logadd, +):
Xi ∆
δt (k) = logadd
{[j]t1 ∩ [j]t =k}
s([x]t1 , [j]t1 , θ̃)
We can now maximize in (8) the log-likelihood (13) over all the training pairs ([x]T1 , [y]T1 ).
At inference time, given a sentence [x]T1 to tag, we have to find the best tag path which
minimizes the sentence score (12). In other words, we must find
The Viterbi algorithm is the natural choice for this inference. It corresponds to performing
the recursion (14) and (15), but where the logadd is replaced by a max, and then tracking
back the optimal path through each max.
9. In other words, read logadd as ⊕ and + as ⊗.
15
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
v
2003) or SRL (Cohn and Blunsom, 2005). Compared to such CRFs, we take advantage of
the nonlinear network to learn appropriate features for each task of interest.
∂ log p(y | x, θ)
θ ←− θ + λ , (17)
∂θ
where λ is a chosen learning rate. Our neural networks described in Figure 1 and Figure 2
are a succession of layers that correspond to successive composition of functions. The neural
network is finally composed with the word-level log-likelihood (11), or successively composed
ar
in the recursion (14) if using the sentence-level log-likelihood (13). Thus, an analytical
formulation of the derivative (17) can be computed, by applying the differentiation chain
rule through the network, and through the word-level log-likelihood (11) or through the
recurrence (14).
16
Natural Language Processing (almost) from Scratch
Task Window/Conv. size Word dim. Caps dim. Hidden units Learning rate
v
POS dwin = 5 d0 = 50 d1 =5 n1hu = 300 λ = 0.01
CHUNK ” ” ” ” ”
NER ” ” ” ” ”
n1hu = 300
Xi
SRL ” ” ”
n2hu = 500
Table 5: Hyper-parameters of our networks. We report for each task the window size
”
(or convolution size), word feature dimension, capital feature dimension, number of hidden
units and learning rate.
compute derivatives with respect to its inputs and with respect to its trainable parameters,
ar
as proposed by Bottou and Gallinari (1991). This allows us to easily build variants of our
networks. For details about gradient computations, see Appendix A.
Remark 7 (Tricks) Many tricks have been reported for training neural networks (LeCun
et al., 1998). Which ones to choose is often confusing. We employed only two of them: the
initialization and update of the parameters of each network layer were done according to
the “fan-in” of the layer, that is the number of inputs used to compute each output of this
layer (Plaut and Hinton, 1987). The fan-in for the lookup table (1), the lth linear layer (4)
and the convolution layer (6) are respectively 1, nl−1 l−1
hu and dwin ×nhu . The initial parameters
of the network were drawn from a centered uniform distribution, with a variance equal to
the inverse of the square-root of the fan-in. The learning rate in (17) was divided by the
fan-in, but stays fixed during the training.
17
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Table 6: Word embeddings in the word lookup table of a SRL neural network trained from
scratch, with a dictionary of size 100, 000. For each column the queried word is followed by
its index in the dictionary (higher means more rare) and its 10 nearest neighbors (arbitrary
using the Euclidean metric).
v
the other tasks. We performed experiments both with the word-level log-likelihood (WLL)
Xi
and with the sentence-level log-likelihood (SLL). The hyper-parameters of our networks are
reported in Table 5. All our networks were fed with two raw text features: lower case words,
and a capital letter feature. We chose to consider lower case words to limit the number
of words in the dictionary. However, to keep some upper case information lost by this
transformation, we added a “caps” feature which tells if each word was in low caps, was all
caps, had first letter capital, or had one capital. Additionally, all occurrences of sequences
of numbers within a word are replaced with the string “NUMBER”, so for example both the
words “PS1” and “PS2” would map to the single word “psNUMBER”. We used a dictionary
ar
containing the 100,000 most common words in WSJ (case insensitive). Words outside this
dictionary were replaced by a single special “RARE” word.
Results show that neural networks “out-of-the-box” are behind baseline benchmark
systems. Looking at all submitted systems reported on each CoNLL challenge website
showed us our networks performance are nevertheless in the performance ballpark of existing
approaches. The training criterion which takes into account the sentence structure (SLL)
seems to boost the performance for the Chunking, NER and SRL tasks, with little advantage
for POS. This result is in line with existing NLP studies comparing sentence-level and word-
level likelihoods (Liang et al., 2008). The capacity of our network architectures lies mainly
in the word lookup table, which contains 50×100, 000 parameters to train. In the WSJ data,
15% of the most common words appear about 90% of the time. Many words appear only
a few times. It is thus very difficult to train properly their corresponding 50 dimensional
feature vectors in the lookup table. Ideally, we would like semantically similar words to be
close in the embedding space represented by the word lookup table: by continuity of the
neural network function, tags produced on semantically similar sentences would be similar.
We show in Table 6 that it is not the case: neighboring words in the embedding space do
not seem to be semantically related.
18
Natural Language Processing (almost) from Scratch
68.5
91 86
96 68
90.5 85.5
67.5
95.5 90 85 67
100 300 500 700 900 100 300 500 700 900 100 300 500 700 900 100 300 500 700 900
Figure 4: F1 score on the validation set (y-axis) versus number of hidden units (x-axis)
for different tasks trained with the sentence-level likelihood (SLL), as in Table 4. For SRL,
we vary in this graph only the number of hidden units in the second layer. The scale is
adapted for each task. We show the standard deviation (obtained over 5 runs with different
random initialization), for the architecture we picked (300 hidden units for POS, CHUNK
and NER, 500 for SRL).
v
We will focus in the next section on improving these word embeddings by leveraging
unlabeled data. We will see our approach results in a performance boost for all tasks.
Remark 8 (Architectures) In all our experiments in this paper, we tuned the hyper-
parameters by trying only a few different architectures by validation. In practice, the choice
Xi
of hyperparameters such as the number of hidden units, provided they are large enough, has
a limited impact on the generalization performance. In Figure 4, we report the F1 score
for each task on the validation set, with respect to the number of hidden units. Considering
the variance related to the network initialization, we chose the smallest network achieving
“reasonable” performance, rather than picking the network achieving the top performance
obtained on a single run.
ar
Remark 9 (Training Time) Training our network is quite computationally expensive.
Chunking and NER take about one hour to train, POS takes few hours, and SRL takes
about three days. Training could be faster with a larger learning rate, but we prefered to
stick to a small one which works, rather than finding the optimal one for speed. Second
order methods (LeCun et al., 1998) could be another speedup technique.
19
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
4.1 Datasets
Our first English corpus is the entire English Wikipedia.11 We have removed all paragraphs
containing non-roman characters and all MediaWiki markups. The resulting text was
tokenized using the Penn Treebank tokenizer script.12 The resulting dataset contains about
631 million words. As in our previous experiments, we use a dictionary containing the
100,000 most common words in WSJ, with the same processing of capitals and numbers.
Again, words outside the dictionary were replaced by the special “RARE” word.
Our second English corpus is composed by adding an extra 221 million words extracted
from the Reuters RCV1 (Lewis et al., 2004) dataset.13 We also extended the dictionary to
130, 000 words by adding the 30, 000 most common words in Reuters. This is useful in order
to determine whether improvements can be achieved by further increasing the unlabeled
dataset size.
v
We used these unlabeled datasets to train language models that compute scores describing
the acceptability of a piece of text. These language models are again large neural networks
using the window approach described in Section 3.2.1 and in Figure 1. As in the previous
section, most of the trainable parameters are located in the lookup tables.
Similar language models were already proposed by Bengio and Ducharme (2001) and
Xi
Schwenk and Gauvain (2002). Their goal was to estimate the probability of a word given
the previous words in a sentence. Estimating conditional probabilities suggests a cross-
entropy criterion similar to those described in Section 3.3.1. Because the dictionary size is
large, computing the normalization term can be extremely demanding, and sophisticated
approximations are required. More importantly for us, neither work leads to significant
word embeddings being reported.
Shannon (1951) has estimated the entropy of the English language between 0.6 and 1.3
bits per character by asking human subjects to guess upcoming characters. Cover and King
ar
(1978) give a lower bound of 1.25 bits per character using a subtle gambling approach.
Meanwhile, using a simple word trigram model, Brown et al. (1992b) reach 1.75 bits per
character. Teahan and Cleary (1996) obtain entropies as low as 1.46 bits per character
using variable length character n-grams. The human subjects rely of course on all their
knowledge of the language and of the world. Can we learn the grammatical structure of the
English language and the nature of the world by leveraging the 0.2 bits per character that
separate human subjects from simple n-gram models? Since such tasks certainly require
high capacity models, obtaining sufficiently small confidence intervals on the test set entropy
may require prohibitively large training sets.14 The entropy criterion lacks dynamical range
because its numerical value is largely determined by the most frequent phrases. In order to
learn syntax, rare but legal phrases are no less significant than common phrases.
11. Available at http://download.wikimedia.org. We took the November 2007 version.
12. Available at http://www.cis.upenn.edu/~treebank/tokenization.html.
13. Now available at http://trec.nist.gov/data/reuters/reuters.html.
14. However, Klein and Manning (2002) describe a rare example of realistic unsupervised grammar induction
using a cross-entropy approach on binary-branching parsing trees, that is, by forcing the system to
generate a hierarchical representation.
20
Natural Language Processing (almost) from Scratch
where X is the set of all possible text windows with dwin words coming from our training
v
corpus, D is the dictionary of words, and x(w) denotes the text window obtained by replacing
the central word of text window [w]d1win by the word w.
Okanohara and Tsujii (2007) use a related approach to avoiding the entropy criteria
using a binary classification approach (correct/incorrect phrase). Their work focuses on
Xi
using a kernel classifier, and not on learning word embeddings as we do here. Smith and
Eisner (2005) also propose a contrastive criterion which estimates the likelihood of the data
conditioned to a “negative” neighborhood. They consider various data neighborhoods,
including sentences of length dwin drawn from Ddwin . Their goal was however to perform
well on some tagging task on fully unsupervised data, rather than obtaining generic word
embeddings useful for other tasks.
21
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
these on the k processors. In our case, possible parameters to adjust are: the learning rate
λ, the word embedding dimensions d, number of hidden units n1hu and input window size
dwin . One then trains each of these models in an online fashion for a certain amount of
time (i.e. a few days), and then selects the best ones using the validation set error rate.
That is, breeding decisions were made on the basis of the value of the ranking criterion (18)
estimated on a validation set composed of one million words held out from the Wikipedia
corpus. In the next breeding iteration, one then chooses another set of k parameters from
the possible grid of values that permute slightly the most successful candidates from the
previous round. As many of these parameter choices can share weights, we can effectively
continue online training retaining some of the learning from the previous iterations.
Very long training times make such strategies necessary for the foreseeable future: if we
had been given computers ten times faster, we probably would have found uses for datasets
ten times bigger. However, we should say we believe that although we ended up with a
particular choice of parameters, many other choices are almost equally as good, although
perhaps there are others that are better as we could not do a full grid search.
v
In the following subsections, we report results obtained with two trained language
models. The results achieved by these two models are representative of those achieved
by networks trained on the full corpuses.
Xi
• Language model LM1 has a window size dwin = 11 and a hidden layer with n1hu = 100
units. The embedding layers were dimensioned like those of the supervised networks
(Table 5). Model LM1 was trained on our first English corpus (Wikipedia) using
successive dictionaries composed of the 5000, 10, 000, 30, 000, 50, 000 and finally
100, 000 most common WSJ words. The total training time was about four weeks.
• Language model LM2 has the same dimensions. It was initialized with the embeddings
of LM1, and trained for an additional three weeks on our second English corpus
(Wikipedia+Reuters) using a dictionary size of 130,000 words.
ar
4.4 Embeddings
Both networks produce much more appealing word embeddings than in Section 3.4. Table 7
shows the ten nearest neighbors of a few randomly chosen query words for the LM1 model.
The syntactic and semantic properties of the neighbors are clearly related to those of the
query word. These results are far more satisfactory than those reported in Table 7 for
embeddings obtained using purely supervised training of the benchmark NLP tasks.
• Ad-hoc approaches such as (Rosenfeld and Feldman, 2007) for relation extraction.
22
Natural Language Processing (almost) from Scratch
Table 7: Word embeddings in the word lookup table of the language model neural network
LM1 trained with a dictionary of size 100, 000. For each column the queried word is followed
by its index in the dictionary (higher means more rare) and its 10 nearest neighbors (using
the Euclidean metric, which was chosen arbitrarily).
v
set with examples from the unlabeled dataset using the labels predicted by the model
Xi
itself. Transductive approaches, such as (Joachims, 1999) for text classification can
be viewed as a refined form of self-training.
• Parameter sharing approaches such as (Ando and Zhang, 2005; Suzuki and Isozaki,
2008). Ando and Zhang propose a multi-task approach where they jointly train
models sharing certain parameters. They train POS and NER models together with a
language model (trained on 15 million words) consisting of predicting words given the
surrounding tokens. Suzuki and Isozaki embed a generative model (Hidden Markov
Model) inside a CRF for POS, Chunking and NER. The generative model is trained
ar
on one billion words. These approaches should be seen as a linear counterpart of our
work. Using multilayer models vastly expands the parameter sharing opportunities
(see Section 5).
Our approach simply consists of initializing the word lookup tables of the supervised
networks with the embeddings computed by the language models. Supervised training is
then performed as in Section 3.4. In particular the supervised training stage is free to
modify the lookup tables. This sequential approach is computationally convenient because
it separates the lengthy training of the language models from the relatively fast training of
the supervised networks. Once the language models are trained, we can perform multiple
experiments on the supervised networks in a relatively short time. Note that our procedure
is clearly linked to the (semi-supervised) deep learning procedures of (Hinton et al., 2006;
Bengio et al., 2007; Weston et al., 2008).
Table 8 clearly shows that this simple initialization significantly boosts the generalization
performance of the supervised networks for each task. It is worth mentioning the larger
language model led to even better performance. This suggests that we could still take
advantage of even bigger unlabeled datasets.
23
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
v
for other tasks.
“Starting for convenience with very short sentence forms, say ABC, we
choose a particular word choice for all the classes, say Bq Cq , except one, in
24
Natural Language Processing (almost) from Scratch
this case A; for every pair of members Ai , Aj of that word class we ask how
the sentence formed with one of the members, i.e. Ai Bq Cq compares as to
acceptability with the sentence formed with the other member, i.e. Aj Bq Cq .”
“It now turns out that, given the graded n-tuples of words for a particular
sentence form, we can find other sentences forms of the same word classes in
which the same n-tuples of words produce the same grading of sentences.”
This is an indication that these two sentence forms exploit common words with the same
syntactic function and possibly the same meaning. This observation forms the empirical
basis for the construction of operator grammars that describe real-world natural languages
such as English.
Therefore there are solid reasons to believe that the ranking criterion (18) has the
conceptual potential to capture strong syntactic and semantic information. On the other
v
hand, the structure of our language models is probably too restrictive for such goals, and
our current approach only exploits the word embeddings discovered during training.
5. Multi-Task Learning
Xi
It is generally accepted that features trained for one task can be useful for related tasks. This
idea was already exploited in the previous section when certain language model features,
namely the word embeddings, were used to initialize the supervised networks.
Multi-task learning (MTL) leverages this idea in a more systematic way. Models for
all tasks of interests are jointly trained with an additional linkage between their trainable
parameters in the hope of improving the generalization error. This linkage can take the form
of a regularization term in the joint cost function that biases the models towards common
representations. A much simpler approach consists in having the models share certain
ar
parameters defined a priori. Multi-task learning has a long history in machine learning and
neural networks. Caruana (1997) gives a good overview of these past efforts.
25
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
v
for comparison.
Xi
makes sense when the training data blocks these additional dependency paths (in the sense
of d-separation, Pearl, 1988). This implies that, without joint training, the additional
dependency paths cannot directly involve unobserved variables. Therefore, the natural idea
of discovering common internal representations across tasks requires joint training.
Joint training is relatively straightforward when the training sets for the individual
tasks contain the same patterns with different labels. It is then sufficient to train a model
that computes multiple outputs for each pattern (Suddarth and Holden, 1991). Using
this scheme, Sutton et al. (2007) demonstrates improvements on POS tagging and noun-
phrase chunking using jointly trained CRFs. However the joint labeling requirement is a
ar
limitation because such data is not often available. Miller et al. (2000) achieves performance
improvements by jointly training NER, parsing, and relation extraction in a statistical
parsing model. The joint labeling requirement problem was weakened using a predictor to
fill in the missing annotations.
Ando and Zhang (2005) propose a setup that works around the joint labeling
requirements. They define linear models of the form fi (x) = wi> Φ(x) + vi> ΘΨ(x) where
fi is the classifier for the i-th task with parameters wi and vi . Notations Φ(x) and Ψ(x)
represent engineered features for the pattern x. Matrix Θ maps the Ψ(x) features into a low
dimensional subspace common across all tasks. Each task is trained using its own examples
without a joint labeling requirement. The learning procedure alternates the optimization
of wi and vi for each task, and the optimization of Θ to minimize the average loss for all
examples in all tasks. The authors also consider auxiliary unsupervised tasks for predicting
substructures. They report excellent results on several tasks, including POS and NER.
26
Natural Language Processing (almost) from Scratch
LTW 1
..
.
LTW K
Linear Linear
M1 × ·
n1
hu n1
hu
HardTanh HardTanh
Linear Linear
2 2
×· ×·
v
M(t1) M(t2)
n2 = #tags n2 = #tags
hu,(t1) hu,(t2)
Task 1 Task 2
Figure 5: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with
Xi
the window approach architecture presented in Figure 1. Lookup tables as well as the first
hidden layer are shared. The last layer is task specific. The principle is the same with more
than two tasks.
27
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
v
and finally, the SRL network was trained with additional CHUNK features.
datasets. The generalization performance for each task was measured using the traditional
Xi
testing data specified in Table 1. Fortunately, none of the training and test sets overlap
across tasks.
While we find worth mentioning that MTL can produce a single unified architecture that
performs well for all these tasks, no (or only marginal) improvements were obtained with
this approach compared to training separate architectures per task (which still use semi-
supervised learning, which is somehow the most important MTL task). The next section
shows we can leverage known correlations between tasks in more direct manner.
6. The Temptation
ar
Results so far have been obtained by staying (almost15 ) true to our from scratch philosophy.
We have so far avoided specializing our architecture for any task, disregarding a lot of useful
a priori NLP knowledge. We have shown that, thanks to large unlabeled datasets, our
generic neural networks can still achieve close to state-of-the-art performance by discovering
useful features. This section explores what happens when we increase the level of task-
specific engineering in our systems by incorporating some common techniques from the
NLP literature. We often obtain further improvements. These figures are useful to quantify
how far we went by leveraging large datasets instead of relying on a priori knowledge.
28
Natural Language Processing (almost) from Scratch
uses inputs representing word suffixes and prefixes up to four characters. We achieve this
in the POS task by adding discrete word features (Section 3.1.1) representing the last two
characters of every word. The size of the suffix dictionary was 455. This led to a small
improvement of the POS performance (Table 10, row NN+SLL+LM2+Suffix2). We also tried
suffixes obtained with the Porter (1980) stemmer and obtained the same performance as
when using two character suffixes.
6.2 Gazetteers
State-of-the-art NER systems often use a large dictionary containing well known named
entities (e.g. Florian et al., 2003). We restricted ourselves to the gazetteer provided
by the CoNLL challenge, containing 8, 000 locations, person names, organizations, and
miscellaneous entities. We trained a NER network with 4 additional word features indicating
(feature “on” or “off”) whether the word is found in the gazetteer under one of these four
categories. The gazetteer includes not only words, but also chunks of words. If a sentence
chunk is found in the gazetteer, then all words in the chunk have their corresponding
v
gazetteer feature turned to “on”. The resulting system displays a clear performance
improvement (Table 10, row NN+SLL+LM2+Gazetteer), slightly outperforming the baseline.
A plausible explanation of this large boost over the network using only the language model
is that gazeetters include word chunks, while we use only the word representation of our
Xi
language model. For example, “united” and “bicycle” seen separately are likely to be non-
entities, while “united bicycle” might be an entity, but catching it would require higher
level representations of our language model.
6.3 Cascading
When one considers related tasks, it is reasonable to assume that tags obtained for one task
can be useful for taking decisions in the other tasks. Conventional NLP systems often use
features obtained from the output of other preexisting NLP systems. For instance, Shen
ar
and Sarkar (2005) describe a chunking system that uses POS tags as input; Florian et al.
(2003) describes a NER system whose inputs include POS and CHUNK tags, as well as
the output of two other NER classifiers. State-of-the-art SRL systems exploit parse trees
(Gildea and Palmer, 2002; Punyakanok et al., 2005), related to CHUNK tags, and built
using POS tags (Charniak, 2000; Collins, 1999).
Table 10 reports results obtained for the CHUNK and NER tasks by adding discrete
word features (Section 3.1.1) representing the POS tags. In order to facilitate comparisons,
instead of using the more accurate tags from our POS network, we use for each task the
POS tags provided by the corresponding CoNLL challenge. We also report results obtained
for the SRL task by adding word features representing the CHUNK tags (also provided by
the CoNLL challenge). We consistently obtain moderate improvements.
6.4 Ensembles
Constructing ensembles of classifiers is a proven way to trade computational efficiency for
generalization performance (Bell et al., 2007). Therefore it is not surprising that many
NLP systems achieve state-of-the-art performance by combining the outputs of multiple
29
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Table 11: Comparison in generalization performance for POS, CHUNK and NER tasks of
the networks obtained using by combining ten training runs with different initialization.
classifiers. For instance, Kudo and Matsumoto (2001) use an ensemble of classifiers trained
with different tagging conventions (see Section 3.2.3). Winning a challenge is of course a
v
legitimate objective. Yet it is often difficult to figure out which ideas are most responsible
for the state-of-the-art performance of a large ensemble.
Because neural networks are nonconvex, training runs with different initial parameters
usually give different solutions. Table 11 reports results obtained for the CHUNK and
Xi
NER task after ten training runs with random initial parameters. Voting the ten network
outputs on a per tag basis (“voting ensemble”) leads to a small improvement over the average
network performance. We have also tried a more sophisticated ensemble approach: the ten
network output scores (before sentence-level likelihood) were combined with an additional
linear layer (4) and then fed to a new sentence-level likelihood (13). The parameters of
the combining layers were then trained on the existing training set, while keeping the ten
networks fixed (“joined ensemble”). This approach did not improve on simple voting.
These ensembles come of course at the expense of a ten fold increase of the running
time. On the other hand, multiple training times could be improved using smart sampling
ar
strategies (Neal, 1996).
We can also observe that the performance variability among the ten networks is not very
large. The local minima found by the training algorithm are usually good local minima,
thanks to the oversized parameter space and to the noise induced by the stochastic gradient
procedure (LeCun et al., 1998). In order to reduce the variance in our experimental results,
we always use the same initial parameters for networks trained on the same task (except of
course for the results reported in Table 11.)
6.5 Parsing
Gildea and Palmer (2002) offer several arguments suggesting that syntactic parsing is a
necessary prerequisite for the SRL task. The CoNLL 2005 SRL benchmark task provides
parse trees computed using both the Charniak (2000) and Collins (1999) parsers. State-of-
the-art systems often exploit additional parse trees such as the k top ranking parse trees
(Koomen et al., 2005; Haghighi et al., 2005).
In contrast our SRL networks so far do not use parse trees at all. They rely instead
on internal representations transferred from a language model trained with an objective
30
Natural Language Processing (almost) from Scratch
S
NP NP VP
NP PP
level 0 The luxury auto maker last year sold
b-np i-np i-np e-np b-np e-np s-vp NP
1,214 cars in
b-np e-np s-pp
the U.S.
b-np e-np
S
VP
The luxury auto maker last year
o o o o o o
level 1 sold 1,214 cars PP
b-vp i-vp e-vp
in the U.S.
b-pp i-pp e-pp
S
VP
level 2 The luxury auto maker last year
v
o o o o o o
sold 1,214 cars in the U.S.
b-vp i-vp i-vp i-vp i-vp e-vp
Figure 6: Charniak parse tree for the sentence “The luxury auto maker last year sold 1,214
cars in the U.S.”. Level 0 is the original tree. Levels 1 to 4 are obtained by successively
Xi
collapsing terminal tree branches. For each level, words receive tags describing the segment
associated with the corresponding leaf. All words receive tag “O” at level 3 in this example.
function that captures a lot of syntactic information (see Section 4.6). It is therefore
legitimate to question whether this approach is an acceptable lightweight replacement for
parse trees.
ar
We answer this question by providing parse tree information as additional input features
to our system. We have limited ourselves to the Charniak parse tree provided with the
CoNLL 2005 data. Considering that a node in a syntactic parse tree assigns a label
to a segment of the parsed sentence, we propose a way to feed (partially) this labeled
segmentation to our network, through additional lookup tables. Each of these lookup
tables encode labeled segments of each parse tree level (up to a certain depth). The labeled
segments are fed to the network following a IOBES tagging scheme (see Sections 3.2.3
and 3.1.1). As there are 40 different phrase labels in WSJ, each additional tree-related
lookup tables has 161 entries (40 × 4 + 1) corresponding to the IBES segment tags, plus the
extra O tag.
We call level 0 the information associated with the leaves of the original Charniak parse
tree. The lookup table for level 0 encodes the corresponding IOBES phrase tags for each
words. We obtain levels 1 to 4 by repeatedly trimming the leaves as shown in Figure 6. We
labeled “O” words belonging to the root node “S”, or all words of the sentence if the root
itself has been trimmed.
Experiments were performed using the LM2 language model using the same network
architectures (see Table 5) and using additional lookup tables of dimension 5 for each
31
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Approach SRL
(valid) (test)
Benchmark System (six parse trees) 77.35 77.92
Benchmark System (top Charniak parse tree only) 74.76 –
NN+SLL+LM2 72.29 74.15
NN+SLL+LM2+Charniak (level 0 only) 74.44 75.65
NN+SLL+LM2+Charniak (levels 0 & 1) 74.50 75.81
NN+SLL+LM2+Charniak (levels 0 to 2) 75.09 76.05
NN+SLL+LM2+Charniak (levels 0 to 3) 75.12 75.89
NN+SLL+LM2+Charniak (levels 0 to 4) 75.42 76.06
NN+SLL+LM2+CHUNK – 74.72
NN+SLL+LM2+PT0 – 75.49
Table 12: Generalization performance on the SRL task of our NN architecture compared
with the benchmark system. We show performance of our system fed with different levels
v
of depth of the Charniak parse tree. We report previous results of our architecture with no
parse tree as a baseline. Koomen et al. (2005) report test and validation performance using
six parse trees, as well as validation performance using only the top Charniak parse tree.
For comparison purposes, we hence also report validation performance. Finally, we report
our performance with the CHUNK feature, and compare it against a level 0 feature PT0
Xi
obtained by our network.
parse tree level. Table 12 reports the performance improvements obtained by providing
increasing levels of parse tree information. Level 0 alone increases the F1 score by almost
1.5%. Additional levels yield diminishing returns. The top performance reaches 76.06% F1
score. This is not too far from the state-of-the-art system which we note uses six parse
ar
trees instead of one. Koomen et al. (2005) also report a 74.76% F1 score on the validation
set using only the Charniak parse tree. Using the first three parse tree levels, we reach this
performance on the validation set.
We also reported in Table 12 our previous performance obtained with the CHUNK
feature (see Table 10). It is surprising to observe that adding chunking features into the
semantic role labeling network performs significantly worse than adding features describing
the level 0 of the Charniak parse tree (Table 12). Indeed, if we ignore the label prefixes
“BIES” defining the segmentation, the parse tree leaves (at level 0) and the chunking
have identical labeling. However, the parse trees identify leaf sentence segments that are
often smaller than those identified by the chunking tags, as shown by Hollingshead et al.
(2005).16 Instead of relying on Charniak parser, we chose to train a second chunking
network to identify the segments delimited by the leaves of the Penn Treebank parse trees
(level 0). Our network achieved 92.25% F1 score on this task (we call it PT0), while we
16. As in (Hollingshead et al., 2005), consider the sentence and chunk labels “(NP They) (VP are starting
to buy) (NP growth stocks)”. The parse tree can be written as “(S (NP They) (VP are (VP starting (S
(VP to (VP buy (NP growth stocks)))))))”. The tree leaves segmentation is thus given by “(NP They)
(VP are) (VP starting) (VP to) (VP buy) (NP growth stocks)”.
32
Natural Language Processing (almost) from Scratch
evaluated Charniak performance as 91.94% on the same task. As shown in Table 12, feeding
our own PT0 prediction into the SRL system gives similar performance to using Charniak
predictions, and is consistently better than the CHUNK feature.
v
NER (Miller et al., 2004; Ratinov and Roth, 2009), or parsing (Koo et al., 2008). Other
related approaches exist, like phrase clustering (Lin and Wu, 2009) which has been shown
to work well for NER. Finally, Huang and Yates (2009) have recently proposed a smoothed
language modelling approach based on a Hidden Markov Model, with success on POS and
Xi
Chunking tasks.
While a comparison of all these word representations is beyond the scope of this paper,
it is rather fair to question the quality of our word embeddings compared to a popular NLP
approach. In this section, we report a comparison of our word embeddings against Brown
clusters, when used as features into our neural network architecture. We report as baseline
previous results where our word embeddings are fine-tuned for each task. We also report
performance when our embeddings are kept fixed during task-specific training. Since convex
machine learning algorithms are common practice in NLP, we finally report performances
for the convex version of our architecture.
ar
For the convex experiments, we considered the linear version of our neural networks
(instead of having several linear layers interleaved with a non-linearity). While we always
picked the sentence approach for SRL, we had to consider the window approach in this
particular convex setup, as the sentence approach network (see Figure 2) includes a Max
layer. Having only one linear layer in our neural network is not enough to make our
architecture convex: all lookup-tables (for each discrete feature) must also be fixed. The
word-lookup table is simply fixed to the embeddings obtained from our language model
LM2. All other discrete feature lookup-tables (caps, POS, Brown Clusters...) are fixed to a
standard sparse representation. Using the notation introduced in Section 3.1.1, if LTW k is
k k
the lookup-table of the k th discrete feature, we have W k ∈ R|D |×|D | and the representation
of the discrete input w is obtained with:
T
k 1
LTW k (w) = hW iw = 0, · · · 0, 1 , 0, · · · 0 . (19)
at index w
Training our architecture in this convex setup with the sentence-level likelihood (13)
corresponds to training a CRF. In that respect, these convex experiments show the
performance of our word embeddings in a classical NLP framework.
33
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Table 13: Generalization performance of our neural network architecture trained with
our language model (LM2) word embeddings, and with the word representations derived
from the Brown Clusters. As before, all networks are fed with a capitalization feature.
v
Additionally, POS is using a word suffix of size 2 feature, CHUNK is fed with POS, NER
uses the CoNLL 2003 gazetteer, and SRL is fed with levels 1–5 of the Charniak parse tree,
as well as a verb position feature. We report performance with both convex and non-convex
architectures (300 hidden units for all tasks, with an additional 500 hidden units layer for
SRL). We also provide results for Brown Clusters induced with a 130K word dictionary, as
Xi
well as Brown Clusters induced with all words of the given tasks.
Following the Ratinov and Roth (2009) and Koo et al. (2008) setups, we generated 1, 000
Brown clusters using the implementation17 from Liang (2005). To make the comparison
fair, the clusters were first induced on the concatenation of Wikipedia and Reuters datasets,
as we did in Section 4 for training our largest language model LM2, using a 130K word
dictionary. This dictionary covers about 99% of the words in the training set of each task.
ar
To cover the last 1%, we augmented the dictionary with the missing words (reaching about
140K words) and induced Brown Clusters using the concatenation of WSJ, Wikipedia, and
Reuters.
The Brown clustering approach is hierarchical and generates a binary tree of clusters.
Each word in the vocabulary is assigned to a node in the tree. Features are extracted from
this tree by considering the path from the root to the node containing the word of interest.
Following Ratinov & Roth, we picked as features the path prefixes of size 4, 6, 10 and 20. In
the non-convex experiments, we fed these four Brown Cluster features to our architecture
using four different lookup tables, replacing our word lookup table. The size of the lookup
tables was chosen to be 12 by validation. In the convex case, we used the classical sparse
representation (19), as for any other discrete feature.
We first report in Table 13 generalization performance of our best non-convex networks
trained with our LM2 language model and with Brown Cluster features. Our embeddings
perform at least as well as Brown Clusters. Results are more mitigated in a convex setup.
For most task, going non-convex is better for both word representation types. In general,
17. Available at http://www.eecs.berkeley.edu/~pliang/software.
34
Natural Language Processing (almost) from Scratch
Task Features
POS Suffix of size 2
CHUNK POS
NER CoNLL 2003 gazetteer
PT0 POS
SRL PT0, verb position
Table 14: Features used by SENNA implementation, for each task. In addition, all tasks
use “low caps word” and “caps” features.
“fine-tuning” our embeddings for each task also gives an extra boost. Finally, using a better
word coverage with Brown Clusters (“all words” instead of “130K words” in Table 13) did
not help.
More complex features could be possibly combined instead of using a non-linear
model. For instance, Turian et al. (2010) performed a comparison of Brown Clusters and
v
embeddings trained in the same spirit as ours18 , with additional features combining labels
and tokens. We believe this type of comparison should be taken with care, as combining
a given feature with different word representations might not have the same effect on each
word representation.
Xi
6.7 Engineering a Sweet Spot
We implemented a standalone version of our architecture, written in the C language.
We gave the name “SENNA” (Semantic/syntactic Extraction using a Neural Network
Architecture) to the resulting system. The parameters of each architecture are the ones
described in Table 5. All the networks were trained separately on each task using the
sentence-level likelihood (SLL). The word embeddings were initialized to LM2 embeddings,
and then fine-tuned for each task. We summarize features used by our implementation
ar
in Table 14, and we report performance achieved on each task in Table 15. The runtime
version19 contains about 2500 lines of C code, runs in less than 150MB of memory, and needs
less than a millisecond per word to compute all the tags. Table 16 compares the tagging
speeds for our system and for the few available state-of-the-art systems: the Toutanova et al.
(2003) POS tagger20 , the Shen et al. (2007) POS tagger21 and the Koomen et al. (2005) SRL
system.22 All programs were run on a single 3GHz Intel core. The POS taggers were run
with Sun Java 1.6 with a large enough memory allocation to reach their top tagging speed.
18. However they did not reach our embedding performance. There are several differences in how they
trained their models that might explain this. Firstly, they may have experienced difficulties because
they train 50-dimensional embeddings for 269K distinct words using a comparatively small training set
(RCV1, 37M words), unlikely to contain enough instances of the rare words. Secondly, they predict the
correctness of the final word of each window instead of the center word (Turian et al., 2010), effectively
restricting the model to unidirectional prediction. Finally, they do not fine tune their embeddings after
unsupervised training.
35
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Table 15: Performance of the engineered sweet spot (SENNA) on various tagging tasks. The
PT0 task replicates the sentence segmentation of the parse tree leaves. The corresponding
benchmark score measures the quality of the Charniak parse tree leaves relative to the Penn
Treebank gold parse trees.
v
SENNA 32 4
Table 16: Runtime speed and memory consumption comparison between state-of-the-art
systems and our approach (SENNA). We give the runtime in seconds for running both
the POS and SRL taggers on their respective testing sets. Memory usage is reported in
megabytes.
The beam size of the Shen tagger was set to 3 as recommended in the paper. Regardless
ar
of implementation differences, it is clear that our neural networks run considerably faster.
They also require much less memory. Our POS and SRL tagger runs in 32MB and 120MB
of RAM respectively. The Shen and Toutanova taggers slow down significantly when the
Java machine is given less than 2.2GB and 800MB of RAM respectively, while the Koomen
tagger requires at least 3GB of RAM.
We believe that a number of reasons explain the speed advantage of our system. First,
our system only uses rather simple input features and therefore avoids the nonnegligible
computation time associated with complex handcrafted features. Secondly, most network
computations are dense matrix-vector operations. In contrast, systems that rely on a great
number of sparse features experience memory latencies when traversing the sparse data
structures. Finally, our compact implementation is self-contained. Since it does not rely on
the outputs of disparate NLP system, it does not suffer from communication latency issues.
7. Critical Discussion
Although we believe that this contribution represents a step towards the “NLP from scratch”
objective, we are keenly aware that both our goal and our means can be criticized.
36
Natural Language Processing (almost) from Scratch
The main criticism of our goal can be summarized as follows. Over the years, the NLP
community has developed a considerable expertise in engineering effective NLP features.
Why should they forget this painfully acquired expertise and instead painfully acquire
the skills required to train large neural networks? As mentioned in our introduction, we
observe that no single NLP task really covers the goals of NLP. Therefore we believe that
task-specific engineering (i.e. that does not generalize to other tasks) is not desirable. But
we also recognize how much our neural networks owe to previous NLP task-specific research.
The main criticism of our means is easier to address. Why did we choose to rely on a
twenty year old technology, namely multilayer neural networks? We were simply attracted
by their ability to discover hidden representations using a stochastic learning algorithm
that scales linearly with the number of examples. Most of the neural network technology
necessary for our work has been described ten years ago (e.g. Le Cun et al., 1998). However,
if we had decided ten years ago to train the language model network LM2 using a vintage
computer, training would only be nearing completion today. Training algorithms that scale
linearly are most able to benefit from such tremendous progress in computer hardware.
v
8. Conclusion
We have presented a multilayer neural network architecture that can handle a number of
NLP tasks with both speed and accuracy. The design of this system was determined by
Xi
our desire to avoid task-specific engineering as much as possible. Instead we rely on large
unlabeled datasets and let the training algorithm discover internal representations that
prove useful for all the tasks of interest. Using this strong basis, we have engineered a fast
and efficient “all purpose” NLP tagger that we hope will prove useful to the community.
Acknowledgments
ar
We acknowledge the persistent support of NEC for this research effort. We thank Yoshua
Bengio, Samy Bengio, Eric Cosatto, Vincent Etter, Hans-Peter Graf, Ralph Grishman, and
Vladimir Vapnik for their useful feedback and comments.
37
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Partionning the parameters of the network with respect to each layers 1 ≤ l ≤ L, we write:
θ = (θ1 , . . . , θl , . . . , θL ) .
v
We are now interested in computing the gradients of the cost with respect to each θl .
Applying the chain rule (generalized to vectors) we obtain the classical backpropagation
recursion:
Xi ∂C
∂C
∂θl
=
=
∂fθl ∂C
∂θl ∂fθl
∂fθl ∂C
.
(20)
(21)
∂fθl−1 ∂fθl−1 ∂fθl
In other words, we first initialize the recursion by computing the gradient of the cost with
respect to the last layer output ∂C/∂fθL . Then each layer l computes the gradient respect
to its own parameters with (20), given the gradient coming from its output ∂C/∂fθl . To
ar
perform the backpropagation, it also computes the gradient with respect to its own inputs,
as shown in (21). We now derive the gradients for each layer we used in this paper.
Lookup Table Layer Given a matrix of parameters θ1 = W 1 and word (or discrete
feature) indices [w]T1 , the layer outputs the matrix:
fθl ([w]Tl ) = hW i1[w] hW i1[w] . . . hW i1[w]
1 2
. T
∂C X ∂C 1
= h i
∂hW i1i ∂fθl i
{1≤t≤T / [w]t =i}
This sum equals zero if the index i in the lookup table does not corresponds to a word in
the sequence. In this case, the ith column of W does not need to be updated. As a Lookup
Table Layer is always the first layer, we do not need to compute its gradients with respect
to the inputs.
38
Natural Language Processing (almost) from Scratch
Linear Layer Given parameters θl = (W l , bl ), and an input vector fθl−1 the output is
given by:
fθl = W l fθl−1 + bl . (22)
The gradients with respect to the parameters are then obtained with:
∂C ∂C h l−1 iT ∂C ∂C
l
= l
fθ and l
= , (23)
∂W ∂fθ ∂b ∂fθl
and the gradients with respect to the inputs are computed with:
∂C h iT ∂C
= Wl . (24)
∂fθl−1 ∂fθl
Convolution Layer Given a input matrix fθl−1 , a Convolution Layer fθl (·) applies a
Linear Layer operation (22) successively on each window hfθl−1 idt win (1 ≤ t ≤ T ) of size
v
dwin . Using (23), the gradients of the parameters are thus given by summing over all
windows:
T T
∂C X ∂C 1 h l−1 dwin iT ∂C X ∂C
l
= h l it hfθ it and l
= h l i1t .
∂W ∂fθ ∂b ∂fθ
t=1 t=1
Xi
After initializing the input gradients ∂C/∂fθl−1 to zero, we iterate (24) over all windows for
1 ≤ t ≤ T , leading the accumulation 23 :
∂C dwin h iT ∂C
h it += W l h l i1t .
∂fθl−1 ∂fθ
where ai stores the index of the largest value. We only need to compute the gradient with
respect to the inputs, as this layer has no parameters. The gradient is given by
" # ( h i
∂C 1
∂C 1 h ∂f i
l t if t = ai
h l−1 it = θ i .
∂fθ i 0 otherwise
HardTanh Layer Given a vector fθl−1 , and the definition of the HardTanh (5) we get
h i
l−1
" #
0 if fθ < −1
i h
∂C h
∂C
i
l−1
i
= ∂fθl i
if − 1 <= fθ <= 1 ,
∂fθl−1 i h i i
0 if fθl−1 > 1
i
39
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
Word-Level Log-Likelihood The network outputs a score [fθ ]i for each tag indexed by
i. Following (11), if y is the true tag for a given example, the stochastic score to minimize
can be written as
C(fθ ) = logadd [fθ ]j − [fθ ]y
j
Considering the definition of the logadd (10), the gradient with respect to fθ is given by
∂C e[fθ ]i
= P [f ] − 1i=y ∀i.
∂ [fθ ]i ke
θ k
v
1
| {z }
Clogadd
with
T
X
s([x]T1 , [y]T1 , θ̃) = [A][y] + [fθ ][y] .
Xi
We first initialize all gradients to zero
t=1
t−1 , [y]t t, t
∂C ∂C
= 0 ∀i, t and = 0 ∀i, j .
∂ fθ i, t ∂ [A]i, j
We then accumulate gradients over the second part of the cost −s([x]T1 , [y]T1 , θ̃), which
gives:
ar
∂C
+= 1
∂ fθ [y] , t
t
∀t .
∂C
+= 1
∂ [A][y] , [y]
t−1 t
We now need to accumulate the gradients over the first part of the cost, that is Clogadd .
We differentiate Clogadd by applying the chain rule through the recursion (14). First we
initialize our recursion with
∂Clogadd eδT (i)
= P δ (k) ∀i .
∂δT (i) ke
T
40
Natural Language Processing (almost) from Scratch
where at each step t of the recursion we accumulate of the gradients with respect to the
inputs fθ , and the transition scores [A]i, j :
∂C ∂C ∂δ (i) ∂Clogadd
+= logadd t =
∂ fθ i, t ∂δt (i) ∂ fθ i, t ∂δt (i)
∂C ∂Clogadd ∂δt (j) ∂Clogadd eδt−1 (i)+[A]i, j
+= = P δt−1 (k)+[A] .
∂ [A]i, j ∂δt (j) ∂ [A]i, j ∂δt (j) ke
k, j
Ranking Criterion We use the ranking criterion (18) for training our language model.
In this case, given a “positive” example x and a “negative” example x(w) , we want to
minimize: n o
C(fθ (x), fθ (xw )) = max 0 , 1 − fθ (x) + fθ (x(w) ) . (26)
Ignoring the non-differentiability of max(0, ·) in zero, the gradient is simply given by:
−1
v
∂C if 1 − fθ (x) + fθ (x(w) ) > 0
!
∂fθ (x) 1
∂C
= .
0
∂fθ (xw ) otherwise
0
Xi
References
R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple
tasks and unlabeled data. JMLR, 6:1817–1953, 11 2005.
R. M. Bell, Y. Koren, and C. Volinsky. The BellKor solution to the Netflix Prize. Technical
report, AT&T Labs, 2007. http://www.research.att.com/~volinsky/netflix.
Y. Bengio and R. Ducharme. A neural probabilistic language model. In NIPS 13, 2001.
ar
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep
networks. In Advances in Neural Information Processing Systems, NIPS 19, 2007.
Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In International
Conference on Machine Learning, ICML, 2009.
L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nı̂mes
91, Nimes, France, 1991. EC2.
L. Bottou. Online algorithms and stochastic approximations. In David Saad, editor, Online
Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.
L. Bottou and P. Gallinari. A framework for the cooperation of learning algorithms. In
D. Touretzky and R. Lippmann, editors, Advances in Neural Information Processing
Systems, volume 3. Morgan Kaufmann, Denver, 1991.
L. Bottou, Y. LeCun, and Yoshua Bengio. Global training of document processing systems
using graph transformer networks. In Proc. of Computer Vision and Pattern Recognition,
pages 489–493, Puerto-Rico, 1997. IEEE.
41
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
C. J. C. Burges, R. Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost
functions. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural
Information Processing Systems 19, pages 193–200. MIT Press, Cambridge, MA, 2007.
v
R. Caruana. Multitask Learning. Machine Learning, 28(1):41–75, 1997.
T. Cohn and P. Blunsom. Semantic role labelling with tree conditional random fields. In
Ninth Conference on Computational Natural Language (CoNLL), 2005.
M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis,
University of Pennsylvania, 1999.
R. Collobert. Large Scale Machine Learning. PhD thesis, Université Paris VI, 2004.
T. Cover and R. King. A convergent gambling estimate of the entropy of english. IEEE
Transactions on Information Theory, 24(4):413–421, July 1978.
42
Natural Language Processing (almost) from Scratch
D. Gildea and M. Palmer. The necessity of parsing for predicate argument recognition.
Proceedings of the 40th Annual Meeting of the ACL, pages 239–246, 2002.
A. Haghighi, K. Toutanova, and C. D. Manning. A joint model for semantic role labeling.
In Proceedings of the Ninth Conference on Computational Natural Language Learning
v
(CoNLL-2005). Association for Computational Linguistics, June 2005.
Z. S. Harris. Mathematical Structures of Language. John Wiley & Sons Inc., 1968.
G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets.
Neural Comp., 18(7):1527–1554, July 2006.
T. Joachims. Transductive inference for text classification using support vector machines.
In ICML, 1999.
43
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
T. Kudoh and Y. Matsumoto. Use of support vector learning for chunk identification. In
Proceedings of CoNLL-2000 and LLL-2000, pages 142–144, 2000.
v
Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to
document recognition. Proceedings of IEEE, 86(11):2278–2324, 1998.
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text
categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
A. McCallum and Wei Li. Early results for named entity recognition with conditional
random fields, feature induction and web-enhanced lexicons. In Proceedings of the
seventh conference on Natural language learning at HLT-NAACL 2003, pages 188–191.
Association for Computational Linguistics, 2003.
44
Natural Language Processing (almost) from Scratch
S. Miller, J. Guinness, and A. Zamanian. Name tagging with word clusters and
discriminative training. In Proceedings of HLT-NAACL, pages 337–342, 2004.
A Mnih and G. E. Hinton. Three new graphical models for statistical language modelling.
In International Conference on Machine Learning, ICML, pages 641–648, 2007.
G. Musillo and P. Merlo. Robust Parsing of the Proposition Bank. ROMAND 2006: Robust
v
Methods in Analysis of Natural language Data, 2006.
R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in
Statistics. Springer-Verlag, New York, 1996.
V. Punyakanok, D. Roth, and W. Yih. The necessity of syntactic parsing for semantic role
labeling. In IJCAI, pages 1117–1123, 2005.
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition.
In Proceedings of the Thirteenth Conference on Computational Natural Language Learning
(CoNLL), pages 147–155. Association for Computational Linguistics, 2009.
45
Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa
A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Eric Brill and
Kenneth Church, editors, Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pages 133–142. Association for Computational Linguistics, 1996.
v
H. Schwenk and J. L. Gauvain. Connectionist language modeling for large vocabulary
continuous speech recognition. In IEEE International Conference on Acoustics, Speech,
and Signal Processing, pages 765–768, 2002.
Xi
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In NAACL ’03:
Proceedings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology, pages 134–141. Association
for Computational Linguistics, 2003.
C. E. Shannon. Prediction and entropy of printed english. Bell Systems Technical Journal,
30:50–64, 1951.
H. Shen and A. Sarkar. Voting between multiple data representations for text chunking.
ar
Advances in Artificial Intelligence, pages 389–400, 2005.
L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional sequence classification.
In Proceedings of the 45th Annual Meeting of the Association for Computational
Linguistics (ACL), 2007.
S. C. Suddarth and A. D. C. Holden. Symbolic-neural systems and the use of hints for
developing complex systems. International Journal of Man-Machine Studies, 35(3):291–
311, 1991.
46
Natural Language Processing (almost) from Scratch
C. Sutton and A. McCallum. Joint parsing and semantic role labeling. In Proceedings of
CoNLL-2005, pages 225–228, 2005a.
C. Sutton and A. McCallum. Composition of conditional random fields for transfer learning.
Proceedings of the conference on Human Language Technology and Empirical Methods in
Natural Language Processing, pages 748–754, 2005b.
J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-
word scale unlabeled data. In Proceedings of ACL-08: HLT, pages 665–673, Columbus,
Ohio, June 2008. Association for Computational Linguistics.
W. J. Teahan and J. G. Cleary. The entropy of english using ppm-based models. In In Data
v
Compression Conference (DCC’96), pages 53–62. IEEE Computer Society Press, 1996.
47
Review of differential calculus theory 1 1
Author: Guillaume Genthial
Winter 2017
1 Introduction
We use derivatives all the time, but we forget what they mean. In
general, we have in mind that for a function f : R 7→ R, we have
something like
f ( x + h) − f ( x ) ≈ f 0 ( x )h
f 0 (x)
df
dx
∂f
∂x
∇x f
Scalar-product and dot-product
However, these notations refer to different mathematical objects, Given two vectors a and b,
and the confusion can lead to mistakes. This paper recalls some • scalar-product h a|bi = ∑in=1 ai bi
notions about these objects. • dot-product a T · b = h a|bi =
∑in=1 ai bi
review of differential calculus theory 2
2 Theory for f : Rn 7→ R
2.1 Differential
Notation
Formal definition dx f is a linear form Rn 7→ R
Let’s consider a function f : Rn 7→ R defined on Rn with the scalar This is the best linear approximation
of the function f
product h·|·i. We suppose that this function is differentiable, which
means that for x ∈ Rn (fixed) and a small variation h (can change) we
can write: dx f is called the differential of f in x
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (1)
Example !
x1
Let f : R 7→ R such that f (
2 ) = 3x1 + x22 . Let’s pick
x2
! !
a h 1
∈ R2 and h = ∈ R2 . We have
b h2
!
a + h1
f( ) = 3( a + h1 ) + ( b + h2 )2
b + h2
= 3a + 3h1 + b2 + 2bh2 + h22
= 3a + b2 + 3h1 + 2bh2 + h22
= f ( a, b) + 3h1 + 2bh2 + o (h)
! h 2 = h · h = o h →0 ( h )
f(
h1
Then, d ) = 3h1 + 2bh2
a
h2
b
dx (h) = hu| hi
∇ x f := u
Then, as a conclusion, we can rewrite equation 2.1 Gradients and differential of a func-
tion are conceptually very different.
The gradient is a vector, while the
differential is a function
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (2)
= f ( x ) + h∇ x f |hi + oh→0 (h) (3)
Example !
x1
Same example as before, f : R2 7→ R such that f ( ) =
x2
3x1 + x22 . We showed that
!
h1
d f ( ) = 3h1 + 2bh2
a h2
b
We can rewrite this as
! ! !
h1 3 h1
d f ( )=h | i
a
h2 2b h2
b
and thus our gradient is
!
3
∇ f =
a 2b
b
i
with respect to the i-th component and evaluated in x.
Example
Same example as before, f : R2 7→ R such that f ( x1 , x2 ) =
3x1 + x22 . Let’s write Depending on the context, most people
omit to write the ( x ) evaluation and just
write
∂f ∂f
∂x ∈ R instead of ∂x ( x )
n
review of differential calculus theory 4
! !
a+h a
! f( ) − f( )
∂f a b b
( ) = lim
∂x1 b h →0 h
3( a + h) + b2 − (3a + b2 )
= lim
h →0 h
3h
= lim
h →0 h
=3
∂f
where ∂x ( x ) denotes the partial derivative of f with respect to the
i
ith component, evaluated in x.
Example
We showed that
∂ f a
( ) =3
∂x1
b
∂ f a
( ) = 2b
∂x2
b
and that
!
3
∇ f =
a
2b
b
and then we verify that
review of differential calculus theory 5
!
a∂f
∂x1 (
)
b
∇ f = !
a
∂f a
( )
b ∂x2 b
3 Summary
Formal definition
For a function f : Rn 7→ R, we have defined the following objects
which can be summarized in the following equation Recall that a T · b = h a|bi = ∑in=1 ai bi
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) differential
= f ( x ) + h∇ x f |hi + oh→0 (h) gradient
∂f
= f ( x ) + h ( x )|hi + oh→0
∂x
∂f
∂x ( x )
1.
.. |hi + oh→0
= f ( x ) + h partial derivatives
∂f
∂xn ( x )
Remark
Let’s consider x : R 7→ R such that x (u) = u for all u. Then we can
easily check that du x (h) = h. As this differential does not depend on
u, we may simply write dx. That’s why the following expression has The dx that we use refers to the differ-
some meaning, ential of u 7→ u, the identity mapping!
∂f
dx f (·) = ( x )dx (·)
∂x
because
∂f
dx f (h) = ( x )dx (h)
∂x
∂f
= ( x)h
∂x
In higher dimension, we write
n
∂f
dx f = ∑ ∂xi (x)dxi
i =1
4 Jacobian: Generalization to f : Rn 7→ Rm
For a function
review of differential calculus theory 6
x1 f 1 ( x1 , . . . , x n )
. ..
.. 7→
f : .
xn f m ( x1 , . . . , x n )
We can apply the previous section to each f i ( x ) :
f i ( x + h ) = f i ( x ) + d x f i ( h ) + o h →0 ( h )
= f i ( x ) + h∇ x f i |hi + oh→0 (h)
∂f
= f i ( x ) + h i ( x )|hi + oh→0
∂x
∂f ∂f
= f i ( x ) + h( i ( x ), . . . , i ( x ))T |hi + oh→0
∂x1 ∂xn
∂f ∂ f1
1
x1 + h1 x1 ∂x1 ( x ) . . . ∂xn ( x )
.. = f .. +
..
· h + o (h)
f
. . .
∂ fm ∂ fm
xn + hn xn ∂x ( x ) . . . ∂x ( x )
1 n
= f ( x ) + J ( x ) · h + o (h)
y1 !
y1 + 2y2 + 3y3
g ( y2 ) =
y1 y2 y3
y3
∂(y1 +2y2 +3y3 )
∂y (y)T
Jg ( y ) =
∂ ( y1 y2 y3 )
( y ) T
∂y
∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 )
∂y1 (y) ∂y2 (y) ∂y3 (y)
=
∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 )
∂y1 (y) ∂y2 (y) ∂y3 ( y )
!
1 2 3
=
y2 y3 y1 y3 y1 y2
5 Generalization to f : Rn× p 7→ R
∂f
∂x1 ( x )
..
where ∇ x f =
. .
∂f
∂xnp ( x )
Now, we would like to give some meaning to the following equa-
tion The gradient of f wrt to a matrix X is a
matrix of same shape as X and defined
by
f ( X + H ) = f ( X ) + h∇ X f | H i + o ( H ) ∂f
∇ X f ij = ∂X (X)
ij
∂f
∇ X f ij = (X)
∂Xij
that these two terms are equivalent
review of differential calculus theory 8
h∇ x f |hi = h∇ X f | H i
np
∂f ∂f
∑ ∂xi (x)hi = ∑ ∂Xij (X ) Hij
i =1 i,j
6 Generalization to f : Rn× p 7→ Rm
Let’s generalize the generalization of
Applying the same idea as before, we can write the previous section
f ( x + h) = f ( x ) + J ( x ) · h + o (h)
∂ fi
Jijk ( x ) = (x)
∂X jk
Writing the 2d-dot product δ = J ( x ) · h ∈ Rm means that the i-th
component of δ is You can apply the same idea to any
dimensions!
n p
∂ fi
δi = ∑∑ ∂X jk
( x )h jk
j =1 k =1
7 Chain-rule
Formal definition
Now let’s consider f : Rn 7→ Rm and g : R p 7→ Rn . We want
to compute the differential of the composition h = f ◦ g such that
h : x 7→ u = g( x ) 7→ f ( g( x )) = f (u), or
dx ( f ◦ g)
.
It can be shown that the differential is the composition of the dif-
ferentials
dx ( f ◦ g ) = d g( x ) f ◦ dx g
n
Jh ( x )ij = ∑ J f ( g(x))ik · Jg (x)kj
k =1
Example !
x1
Let’s keep our example function f : ( ) 7→ 3x1 + x22 and our
x2
y1 !
y1 + 2y2 + 3y3
function g : (y2 ) = .
y1 y2 y3
y3
The composition of f and g is h = f ◦ g : R3 7→ R
y1 !
y1 + 2y2 + 3y3
h ( y2 ) = f ( )
y1 y2 y3
y3
= 3(y1 + 2y2 + 3y3 ) + (y1 y2 y3 )2
∂h
(y) = 3 + 2y1 y22 y23
∂y1
∂h
(y) = 6 + 2y2 y21 y23
∂y2
∂h
(y) = 9 + 2y3 y21 y22
∂y3
∇x f T = J f (x)
J f (x) = ∇x f T
= 3 2x2
!
1 2 3
Jg ( y ) =
y2 y3 y1 y3 y1 y2
review of differential calculus theory 10
and taking the transpose we find the same gradient that we com-
puted before!
Important remark
• Note that the chain rule gives us a way to compute the Jacobian
and not the gradient. However, we showed that in the case of a
function f : Rn 7→ R, the jacobian and the gradient are directly
identifiable, because ∇ x J T = J ( x ). Thus, if we want to compute
the gradient of a function by using the chain-rule, the best way to
do it is to compute the Jacobian.
• As the gradient must have the same shape as the variable against
which we derive, and
• the notation ∂∂·· is often ambiguous and can refer to either the gra-
dient or the Jacobian.