where frnn is the RNN (GRU or LSTM) function. 3.1.3 Transformer-based NMT
In the first layer, h0i = frnn (Exi , h0i−1 ). Transformers rely heavily on self-attention net-
In addition to the connection between the en- works. Each token is connected to any other
coder and decoder via attention, the initial state of token in the same sentence directly via self-
the decoder is usually initialized with the average attention. Moreover, Transformers feature at-
of the hidden states or the last hidden state of the tention networks with multiple attention heads.
encoder. Multi-head attention is more fine-grained, com-
pared to conventional 1-head attention mecha-
3.1.2 CNN-based NMT
nisms. Figure 1 (c) illustrates that any two to-
CNNs are hierarchical networks, in that convolu- kens are connected directly: the path length be-
tion layers capture local correlations. The local tween the first and the fifth tokens is 1. Similar to
context size depends on the size of the kernel and CNNs, positional information is also preserved in
the number of layers. In order to keep the out- positional embeddings.
put the same length as the input, CNN models add The hidden state in the Transformer encoder is
padding symbols to input sequences. Given an L- calculated from all hidden states of the previous
layer CNN with a kernel size k, the largest context layer. The hidden state hli in a self-attention net-
size is L(k −1). For any two tokens in a local con- work is computed as in Equation 3.
text with a distance of n, the path between them is
only dn/(k − 1)e. hli = hl−1
i + f (self-attention(hl−1
i )) (3)
As Figure 1 (b) shows, a 2-layer CNN with ker-
nel size 3 “sees” an effective local context of 5 to- where f represents a feedforward network with
kens. The path between the first token and the fifth ReLU as the activation function and layer normal-
token is only 2 convolutions.1 Since CNNs do not ization. In the input layer, h0i = Exi + epos,i .
have a means to infer the position of elements in a The decoder additionally has a multi-head atten-
sequence, positional embeddings are introduced. tion over the encoder hidden states.
2016). We employ the model that has the best per- 0.65
0.55 Transformer
4.2 Overall Results
Table 2 reports the BLEU scores on newstest2014 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >15
and newstest2017, the perplexity on the valida-
tion set, and the accuracy on long-range depen- Figure 2: Accuracy of different NMT models on the
dencies.3 Transformer achieves the highest accu- subject-verb agreement task.
racy on this task and the highest BLEU scores on
both newstest2014 and newstest2017. Compared
4.2.1 CNNs
to RNNS2S, ConvS2S has slightly better results re-
garding BLEU scores, but a much lower accuracy Theoretically, the performance of CNNs will drop
on long-range dependencies. The RNN-bideep when the distance between the subject and the verb
model achieves distinctly better BLEU scores and exceeds the local context size. However, ConvS2S
a higher accuracy on long-range dependencies. is also clearly worse than RNNS2S for subject-verb
agreement within the local context size.
http://www.statmt.org/wmt17/ In order to explore how the ability of ConvS2S
We report average accuracy on instances where the dis- to capture long-range dependencies depends on
tance between subject and verb is longer than 10 words. the local context size, we train additional systems,
varying the number of layers and kernel size. Ta- of BLEU, which only measures on the level of n-
ble 3 shows the performance of different ConvS2S grams, but it may also indicate that there are other
models. Figure 3 displays the performance of two trade-offs between the modeling of different phe-
8-layer CNNs with kernel size 3 and 7, a 6-layer nomena depending on hyperparameters. If we aim
CNN with kernel size 3, and RNNS2S. The results to get better performance on long-range dependen-
indicate that the accuracy increases when the local cies, we can take this into account when optimiz-
context size becomes larger, but the BLEU score ing hyperparameters.
does not. Moreover, ConvS2S is still not as good
as RNNS2S for subject-verb agreement. 4.2.2 RNNs vs. Transformer
Even though Transformer achieves much better
Layer K Ctx 2014 2017 Acc(%) BLEU scores than RNNS2S and RNN-bideep, the
4 3 4 22.9 24.2 81.1 accuracies of these architectures on long-range de-
6 3 6 23.6 25.0 82.5 pendencies are close to each other in Figure 2.
8 3 8 23.9 25.2 84.9 Our experimental result contrasts with the result
8 5 16 23.5 24.7 89.7 from Tran et al. (2018). They find that Transform-
8 7 24 23.3 24.6 91.3 ers perform worse than LSTMs on the subject-
verb agreement task, especially when the distance
Table 3: The performance of ConvS2S with different
between the subject and the verb becomes longer.
settings. K means the kernel size. The ctx column is
the theoretical largest local context size in the masked We perform several experiments to analyze this
decoder. discrepancy with Tran et al. (2018).
A first hypothesis is that this is caused by the
amount of training data, since we used much larger
datasets than Tran et al. (2018). We retrain all the
models with a small amount of training data simi-
lar to the amount used by Tran et al. (2018), about
135K sentence pairs. The other training settings
are the same as in Section 4.1. We do not see the
RNNS2S expected degradation of Transformer-s, compared
ConvS2S-d8k7 to RNNS2S-s (see Figure 4). In Table 4, the perfor-
0.6 ConvS2S-d8k3 mance of RNNS2S-s and Transformer-s is similar,
0.55 ConvS2S-d6k3 including the BLEU scores on newstest2014, new-
0.5 stest2017, the perplexity on the validation set, and
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >15 the accuracy on the long-range dependencies.
Figure 3: Results of ConvS2S models and the RNNS2S 0.95
model at different distances. 0.9
Regarding the explanation for the poor perfor-
Table 5: The results of different architectures on newstest sets and ContraWSD. PPL is the perplexity on the
validation set. Acc means accuracy on the test set.
In addition, we also compare to the best result 5.3 Hybrid Encoder-Decoder Model
reported for DE→EN, achieved by uedin-wmt17 In recent work, Chen et al. (2018) find that hybrid
(Sennrich et al., 2017), which is an ensemble of architectures with a Transformer encoder and an
4 different models and reranked with right-to-left RNN decoder can outperform a pure Transformer
models.6 uedin-wmt17 is based on the bi-deep model. They speculate that the Transformer en-
RNNs (Miceli Barone et al., 2017) that we men- coder is better at encoding or extracting features
tioned before. To the original 5.9 million sentence than the RNN encoder, whereas the RNN is better
pairs in the training set, they add 10 million syn- at conditional language modeling.
thetic pairs with back-translation. For WSD, it is unclear whether the most im-
portant component is the encoder, the decoder, or
5.2 Overall Results
both. Following the hypothesis that Transformer
Table 5 gives the performance of all the architec- encoders excel as semantic feature extractors, we
tures, including the perplexity on validation sets, train a hybrid encoder-decoder model (TransRNN)
the BLEU scores on newstest, and the accuracy with a Transformer encoder and an RNN decoder.
on ContraWSD. Transformers distinctly outper- The results (in Table 5) show that TransRNN
form RNNS2S and ConvS2S models on DE→EN performs better than RNNS2S, but worse than the
and DE→FR. Moreover, the Transformer model pure Transformer, both in terms of BLEU and
on DE→EN also achieves higher accuracy than WSD accuracy. This indicates that WSD is not
uedin-wmt17, although the BLEU score on new- only done in the encoder, but that the decoder also
stest2017 is 1.4 lower than uedin-wmt17. We at- affects WSD performance. We note that Chen
tribute this discrepancy between BLEU and WSD et al. (2018); Domhan (2018) introduce the tech-
performance to the use of synthetic news training niques in Transformers into RNN-based models,
data in uedin-wmt17, which causes a large boost in with reportedly higher BLEU. Thus, it would be
BLEU due to better domain adaptation to newstest, interesting to see if the same result holds true with
but which is less helpful for ContraWSD, whose their architectures.
test set is drawn from a variety of domains.
For DE→EN, RNNS2S and ConvS2S have the 6 Conclusion
same BLEU score on newstest2014, ConvS2S has
In this paper, we evaluate three popular NMT ar-
a higher score on newstest2017. However, the
chitectures, RNNS2S, ConvS2S, and Transformers,
WSD accuracy of ConvS2S is 1.7% lower than
on subject-verb agreement and WSD by scoring
RNNS2S. For DE→FR, ConvS2S achieves slightly
contrastive translation pairs.
better results on both BLEU scores and accuracy
We test the theoretical claims that shorter path
than RNNS2S.
lengths make models better capture long-range de-
The Transformer model strongly outperforms
pendencies. Our experimental results show that:
the other architectures on this WSD task, with a
gap of 4–8 percentage points. This affirms our • There is no evidence that CNNs and Trans-
hypothesis that Transformers are strong semantic formers, which have shorter paths through
features extractors. networks, are empirically superior to RNNs
https://github.com/a-rios/ContraWSD/ in modeling subject-verb agreement over
tree/master/baselines long distances.
