1808 08946v2 PDF
1808 08946v2 PDF
1808 08946v2 PDF
where frnn is the RNN (GRU or LSTM) function. 3.1.3 Transformer-based NMT
In the first layer, h0i = frnn (Exi , h0i−1 ). Transformers rely heavily on self-attention net-
In addition to the connection between the en- works. Each token is connected to any other
coder and decoder via attention, the initial state of token in the same sentence directly via self-
the decoder is usually initialized with the average attention. Moreover, Transformers feature at-
of the hidden states or the last hidden state of the tention networks with multiple attention heads.
encoder. Multi-head attention is more fine-grained, com-
pared to conventional 1-head attention mecha-
3.1.2 CNN-based NMT
nisms. Figure 1 (c) illustrates that any two to-
CNNs are hierarchical networks, in that convolu- kens are connected directly: the path length be-
tion layers capture local correlations. The local tween the first and the fifth tokens is 1. Similar to
context size depends on the size of the kernel and CNNs, positional information is also preserved in
the number of layers. In order to keep the out- positional embeddings.
put the same length as the input, CNN models add The hidden state in the Transformer encoder is
padding symbols to input sequences. Given an L- calculated from all hidden states of the previous
layer CNN with a kernel size k, the largest context layer. The hidden state hli in a self-attention net-
size is L(k −1). For any two tokens in a local con- work is computed as in Equation 3.
text with a distance of n, the path between them is
only dn/(k − 1)e. hli = hl−1
i + f (self-attention(hl−1
i )) (3)
As Figure 1 (b) shows, a 2-layer CNN with ker-
nel size 3 “sees” an effective local context of 5 to- where f represents a feedforward network with
kens. The path between the first token and the fifth ReLU as the activation function and layer normal-
token is only 2 convolutions.1 Since CNNs do not ization. In the input layer, h0i = Exi + epos,i .
have a means to infer the position of elements in a The decoder additionally has a multi-head atten-
sequence, positional embeddings are introduced. tion over the encoder hidden states.
2016). We employ the model that has the best per- 0.65
RNN-bideep
0.55 Transformer
4.2 Overall Results
0.5
Table 2 reports the BLEU scores on newstest2014 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >15
Distance
and newstest2017, the perplexity on the valida-
tion set, and the accuracy on long-range depen- Figure 2: Accuracy of different NMT models on the
dencies.3 Transformer achieves the highest accu- subject-verb agreement task.
racy on this task and the highest BLEU scores on
both newstest2014 and newstest2017. Compared
4.2.1 CNNs
to RNNS2S, ConvS2S has slightly better results re-
garding BLEU scores, but a much lower accuracy Theoretically, the performance of CNNs will drop
on long-range dependencies. The RNN-bideep when the distance between the subject and the verb
model achieves distinctly better BLEU scores and exceeds the local context size. However, ConvS2S
a higher accuracy on long-range dependencies. is also clearly worse than RNNS2S for subject-verb
agreement within the local context size.
2
http://www.statmt.org/wmt17/ In order to explore how the ability of ConvS2S
translation-task.html
3
We report average accuracy on instances where the dis- to capture long-range dependencies depends on
tance between subject and verb is longer than 10 words. the local context size, we train additional systems,
varying the number of layers and kernel size. Ta- of BLEU, which only measures on the level of n-
ble 3 shows the performance of different ConvS2S grams, but it may also indicate that there are other
models. Figure 3 displays the performance of two trade-offs between the modeling of different phe-
8-layer CNNs with kernel size 3 and 7, a 6-layer nomena depending on hyperparameters. If we aim
CNN with kernel size 3, and RNNS2S. The results to get better performance on long-range dependen-
indicate that the accuracy increases when the local cies, we can take this into account when optimiz-
context size becomes larger, but the BLEU score ing hyperparameters.
does not. Moreover, ConvS2S is still not as good
as RNNS2S for subject-verb agreement. 4.2.2 RNNs vs. Transformer
Even though Transformer achieves much better
Layer K Ctx 2014 2017 Acc(%) BLEU scores than RNNS2S and RNN-bideep, the
4 3 4 22.9 24.2 81.1 accuracies of these architectures on long-range de-
6 3 6 23.6 25.0 82.5 pendencies are close to each other in Figure 2.
8 3 8 23.9 25.2 84.9 Our experimental result contrasts with the result
8 5 16 23.5 24.7 89.7 from Tran et al. (2018). They find that Transform-
8 7 24 23.3 24.6 91.3 ers perform worse than LSTMs on the subject-
verb agreement task, especially when the distance
Table 3: The performance of ConvS2S with different
between the subject and the verb becomes longer.
settings. K means the kernel size. The ctx column is
the theoretical largest local context size in the masked We perform several experiments to analyze this
decoder. discrepancy with Tran et al. (2018).
A first hypothesis is that this is caused by the
amount of training data, since we used much larger
1
datasets than Tran et al. (2018). We retrain all the
0.95
models with a small amount of training data simi-
0.9
lar to the amount used by Tran et al. (2018), about
0.85
0.8
135K sentence pairs. The other training settings
Accuracy
0.75
are the same as in Section 4.1. We do not see the
0.7
RNNS2S expected degradation of Transformer-s, compared
0.65
ConvS2S-d8k7 to RNNS2S-s (see Figure 4). In Table 4, the perfor-
0.6 ConvS2S-d8k3 mance of RNNS2S-s and Transformer-s is similar,
0.55 ConvS2S-d6k3 including the BLEU scores on newstest2014, new-
0.5 stest2017, the perplexity on the validation set, and
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >15 the accuracy on the long-range dependencies.
Distance
1
Figure 3: Results of ConvS2S models and the RNNS2S 0.95
model at different distances. 0.9
0.85
Regarding the explanation for the poor perfor-
0.8
Accuracy
Table 5: The results of different architectures on newstest sets and ContraWSD. PPL is the perplexity on the
validation set. Acc means accuracy on the test set.
In addition, we also compare to the best result 5.3 Hybrid Encoder-Decoder Model
reported for DE→EN, achieved by uedin-wmt17 In recent work, Chen et al. (2018) find that hybrid
(Sennrich et al., 2017), which is an ensemble of architectures with a Transformer encoder and an
4 different models and reranked with right-to-left RNN decoder can outperform a pure Transformer
models.6 uedin-wmt17 is based on the bi-deep model. They speculate that the Transformer en-
RNNs (Miceli Barone et al., 2017) that we men- coder is better at encoding or extracting features
tioned before. To the original 5.9 million sentence than the RNN encoder, whereas the RNN is better
pairs in the training set, they add 10 million syn- at conditional language modeling.
thetic pairs with back-translation. For WSD, it is unclear whether the most im-
portant component is the encoder, the decoder, or
5.2 Overall Results
both. Following the hypothesis that Transformer
Table 5 gives the performance of all the architec- encoders excel as semantic feature extractors, we
tures, including the perplexity on validation sets, train a hybrid encoder-decoder model (TransRNN)
the BLEU scores on newstest, and the accuracy with a Transformer encoder and an RNN decoder.
on ContraWSD. Transformers distinctly outper- The results (in Table 5) show that TransRNN
form RNNS2S and ConvS2S models on DE→EN performs better than RNNS2S, but worse than the
and DE→FR. Moreover, the Transformer model pure Transformer, both in terms of BLEU and
on DE→EN also achieves higher accuracy than WSD accuracy. This indicates that WSD is not
uedin-wmt17, although the BLEU score on new- only done in the encoder, but that the decoder also
stest2017 is 1.4 lower than uedin-wmt17. We at- affects WSD performance. We note that Chen
tribute this discrepancy between BLEU and WSD et al. (2018); Domhan (2018) introduce the tech-
performance to the use of synthetic news training niques in Transformers into RNN-based models,
data in uedin-wmt17, which causes a large boost in with reportedly higher BLEU. Thus, it would be
BLEU due to better domain adaptation to newstest, interesting to see if the same result holds true with
but which is less helpful for ContraWSD, whose their architectures.
test set is drawn from a variety of domains.
For DE→EN, RNNS2S and ConvS2S have the 6 Conclusion
same BLEU score on newstest2014, ConvS2S has
In this paper, we evaluate three popular NMT ar-
a higher score on newstest2017. However, the
chitectures, RNNS2S, ConvS2S, and Transformers,
WSD accuracy of ConvS2S is 1.7% lower than
on subject-verb agreement and WSD by scoring
RNNS2S. For DE→FR, ConvS2S achieves slightly
contrastive translation pairs.
better results on both BLEU scores and accuracy
We test the theoretical claims that shorter path
than RNNS2S.
lengths make models better capture long-range de-
The Transformer model strongly outperforms
pendencies. Our experimental results show that:
the other architectures on this WSD task, with a
gap of 4–8 percentage points. This affirms our • There is no evidence that CNNs and Trans-
hypothesis that Transformers are strong semantic formers, which have shorter paths through
features extractors. networks, are empirically superior to RNNs
6
https://github.com/a-rios/ContraWSD/ in modeling subject-verb agreement over
tree/master/baselines long distances.
• The number of heads in multi-head attention Zhifeng Chen, Yonghui Wu, and Macduff Hughes.
affects the ability of a Transformer to model 2018. The best of both worlds: Combining recent
advances in neural machine translation. In Proceed-
long-range dependencies in the subject-verb
ings of the 56th Annual Meeting of the Association
agreement task. for Computational Linguistics (Volume 1: Long Pa-
pers), pages 76–86. Association for Computational
• Transformer models excel at another task, Linguistics.
WSD, compared to the CNN and RNN archi-
tectures we tested. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang,
Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan
Zhang, and Zheng Zhang. 2015. Mxnet: A flex-
Lastly, our findings suggest that assessing the per- ible and efficient machine learning library for het-
formance of NMT architectures means finding erogeneous distributed systems. arXiv preprint
their inherent trade-offs, rather than simply com- arXiv:1512.01274.
puting their overall BLEU score. A clear under-
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
standing of those strengths and weaknesses is im- cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
portant to guide further work. Specifically, given ger Schwenk, and Yoshua Bengio. 2014. Learn-
the idiosyncratic limitations of recurrent and self- ing Phrase Representations using RNN Encoder–
attentional models, combining them is an exciting Decoder for Statistical Machine Translation. In Pro-
ceedings of the 2014 Conference on Empirical Meth-
line of research. The apparent weakness of CNN ods in Natural Language Processing, pages 1724–
architectures on long-distance phenomena is also 1734, Doha, Qatar. Association for Computational
a problem worth tackling, and we can find inspi- Linguistics.
ration from related work in computer vision (Xu
Tobias Domhan. 2018. How much attention do you
et al., 2014). need? a granular analysis of neural machine trans-
lation architectures. In Proceedings of the 56th An-
Acknowledgments nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1799–
We thank all the anonymous reviews and Joakim 1808. Association for Computational Linguistics.
Nivre who give a lot of valuable and insight-
ful comments. We appreciate the grants pro- Jeffrey L Elman. 1990. Finding structure in time. Cog-
nitive science, 14(2):179–211.
vided by Erasmus+ Programme and Anna Maria
Lundin’s scholarship committee. GT is funded Jonas Gehring, Michael Auli, David Grangier, De-
by the Chinese Scholarship Council (grant num- nis Yarats, and Yann N. Dauphin. 2017. Convolu-
ber 201607110016). MM, AR and RS have re- tional sequence to sequence learning. In Proceed-
ings of the 34th International Conference on Ma-
ceived funding from the Swiss National Science chine Learning, pages 1243–1252, Sydney, Aus-
Foundation (grant number 105212 169888). tralia. PMLR.
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Marcin Junczys-Dowmunt, Roman Grundkiewicz,
2018. An empirical evaluation of generic convolu- Tomasz Dwojak, Hieu Hoang, Kenneth Heafield,
tional and recurrent networks for sequence model- Tom Neckermann, Frank Seide, Ulrich Germann,
ing. arXiv preprint arXiv:1803.01271. Alham Fikri Aji, Nikolay Bogoychev, André F. T.
Martins, and Alexandra Birch. 2018. Marian: Fast
Jean-Philippe Bernardy and Shalom Lappin. 2017. Neural Machine Translation in C++. arXiv preprint
Using Deep Neural Networks to Learn Syntactic arXiv:1804.00344.
Agreement. LiLT (Linguistic Issues in Language
Technology), 15(2):1–15. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
Continuous Translation Models. In Proceedings of
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin the 2013 Conference on Empirical Methods in Natu-
Johnson, Wolfgang Macherey, George Foster, Llion ral Language Processing, pages 1700–1709, Seattle,
Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Washington, USA. Association for Computational
Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Linguistics.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Rico Sennrich, Barry Haddow, and Alexandra Birch.
Method for Stochastic Optimization. In Proceed- 2016. Neural Machine Translation of Rare Words
ings of the 3rd International Conference on Learn- with Subword Units. In Proceedings of the 54th An-
ing Representations, San Diego, California, USA. nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1715–
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 1725, Berlin, Germany. Association for Computa-
2016. Assessing the Ability of LSTMs to Learn tional Linguistics.
Syntax-Sensitive Dependencies. Transactions of the
Association for Computational Linguistics, 4:521– Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
535. Sequence to Sequence Learning with Neural Net-
works. In Advances in Neural Information Pro-
Thang Luong, Hieu Pham, and Christopher D. Man- cessing Systems 27, pages 3104–3112. Curran As-
ning. 2015. Effective Approaches to Attention- sociates, Inc., Montréal, Canada.
based Neural Machine Translation. In Proceed- Gongbo Tang, Fabienne Cap, Eva Pettersson, and
ings of the 2015 Conference on Empirical Meth- Joakim Nivre. 2018. An evaluation of neural ma-
ods in Natural Language Processing, pages 1412– chine translation models on historical spelling nor-
1421, Lisbon, Portugal. Association for Computa- malization. In Proceedings of the 27th International
tional Linguistics. Conference on Computational Linguistics, pages
1320–1331. Association for Computational Linguis-
Antonio Valerio Miceli Barone, Jindřich Helcl, Rico tics.
Sennrich, Barry Haddow, and Alexandra Birch.
2017. Deep architectures for Neural Machine Trans- Jörg Tiedemann. 2012. Parallel Data, Tools and Inter-
lation. In Proceedings of the Second Conference on faces in OPUS. In Proceedings of the Eighth In-
Machine Translation, pages 99–107, Copenhagen, ternational Conference on Language Resources and
Denmark. Association for Computational Linguis- Evaluation (LREC-2012), pages 2214–2218, Istan-
tics. bul, Turkey. European Language Resources Associ-
ation (ELRA).
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a Method for Automatic Ke Tran, Arianna Bisazza, and Christof Monz.
Evaluation of Machine Translation. In Proceedings 2018. The Importance of Being Recurrent for
of 40th Annual Meeting of the Association for Com- Modeling Hierarchical Structure. arXiv preprint
putational Linguistics, pages 311–318, Philadelphia, arXiv:1803.03585.
Pennsylvania, USA. Association for Computational
Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is All
Matt Post. 2018. A call for clarity in reporting bleu
you Need. In Advances in Neural Information Pro-
scores. arXiv preprint arXiv:1804.08771.
cessing Systems 30, pages 6000–6010. Curran Asso-
ciates, Inc.
Annette Rios, Laura Mascarell, and Rico Sennrich.
2017. Improving Word Sense Disambiguation in Yichong Xu, Tianjun Xiao, Jiaxing Zhang, Kuiyuan
Neural Machine Translation with Sense Embed- Yang, and Zheng Zhang. 2014. Scale-invariant
dings. In Proceedings of the Second Conference convolutional neural networks. arXiv preprint
on Machine Translation, pages 11–19, Copenhagen, arXiv:1411.6369.
Denmark. Association for Computational Linguis-
tics. Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich
Schütze. 2017. Comparative study of cnn and rnn
Rico Sennrich. 2017. How Grammatical is Character- for natural language processing. arXiv preprint
level Neural Machine Translation? Assessing MT arXiv:1702.01923.
Quality with Contrastive Translation Pairs. In Pro-
ceedings of the 15th Conference of the European
Chapter of the Association for Computational Lin-
guistics: Volume 2, Short Papers, pages 376–382,
Valencia, Spain. Association for Computational Lin-
guistics.