ACT Text Classification 20210518
ACT Text Classification 20210518
ACT Text Classification 20210518
13261
former and CNN for efficient text classification. Similar embedding qi ∈ Rdw and obtain the input embeddings
as Transformer, ACT also has a multi-head structure that Q = [q1 , q2 , ..., ql ] by looking up the word embedding
jointly performs attention operations in different subspaces. matrix Wwrd ∈ Rdw ×V , where dw is the dimension of
However, instead of self-attention, a novel attentive convolu- word embedding and V is vocabulary size. Then, n-gram
tion mechanism is performed in each attention head to better convolution over input embeddings Q is performed using
capture local n-gram features. Different from conventional convolutional filters F = [f1 , f2 , ..., fm ], where fi ∈ Rndw
CNN, the proposed attentive convolution utilizes the seman- and n is the convolution kernel size. A feature map matrix
tic meaning of convolutional filters attentively and trans- M ∈ Rm×l is generated as follows:
forms texts from complex word space to a more informative
convolutional filter space. This not only simplifies the opti- M=Q~F (1)
mization of capturing important n-grams for classification, where ~ indicates the convolution operation of fi over Q.
but also allows our model to learn meaningful convolutional Specifically, the value in the feature map is calculated as
filters since all the filters contribute to the final representa- shown in Equation 2:
tion directly. Compared with self-attention, the proposed at-
tentive convolution focuses more on learning important local mij = f (fi T · Cat(qj , qj+1 , ..., qj+n−1 ) + b) (2)
n-gram features globally which are invariant to the specific
inputs. These n-gram features are exactly the keywords and where Cat means concatenation, f is a non-linear activation
phrases that are crucial for text classification. While majority function and b is a bias term.
of existing works augment Transformer with conventional The values in the resulted feature map indicate seman-
CNNs to improve locality modeling capability with the cost tic relevance between n-grams and convolutional filters. By
of introducing more parameters (Yu et al. 2018; Mohamed, treating the feature map values as attention weights and ag-
Okhonko, and Zettlemoyer 2019; Yang et al. 2019a; Gulati gregating the semantic convolutional filters attentively, we
et al. 2020), our work is a more lightweight approach and it transform each n-gram from complex word space to a more
is the first to utilize the semantic meaning of convolutional informative convolutional filter space while preserving the
filters with attention mechanism. sequential information of texts. Formally, the attentive con-
The proposed ACT is also sequence-to-sequence, with an volution for local feature representation is shown in Equa-
additional global representation output by keeping the max- tion 3.
pooling functionality of CNN. Therefore, it is able to cap- O = F · M = F · (Q ~ F) (3)
ndw ×l
ture both local and global features while preserving sequen- where O = [o1 , o2 , ..., ol ] ∈ R is the output obtained
tial information. Furthermore, we propose a global attention from attentive convolution.
mechanism to summarize the outputs of ACT and obtain the Different from self-attention whose output space is still
final representation by taking local, global, and position in- a complex word space with varying components depending
formation into consideration. Experiments are conducted on on the input, the output space in our proposed attentive con-
typical text classification tasks including sentiment analysis volution mechanism is formed by n-gram convolutional fil-
and topic categorization, as well as the more challenging re- ters which are learned globally and invariant to the inputs.
lation extraction task. We present detailed analyses on ACT, In such space, important n-grams will be close to the corre-
results show that ACT is a lightweight and efficient univer- sponding filters and irrelevant n-grams will have small val-
sal text classifier, outperforming existing CNN-based, RNN- ues. Therefore, the important local features (n-grams) appear
based, and attentive models including Transformer. in the texts can be captured effectively.
Global feature representation Besides local features, at-
2 Attentive Convolutional Transformer tentive convolution mechanism can also capture global fea-
We present the proposed ACT in detail in this section. The tures of texts by applying the max-pooling technique which
attentive convolution mechanism of ACT is introduced in is normally used in conventional CNNs. The max-pooling
Section 2.1; the multi-head multi-layer structure of ACT is over each row of the feature map M finds the overall rel-
described in Section 2.2; the global attention mechanism for evance of the texts to each convolutional filter. By aggre-
final text representation is presented in Section 2.3. gating the convolutional filters attentively using the max-
pooling results, we can find the overall semantics of texts
2.1 Attentive Convolution Mechanism in the filter space. Formally, the attentive convolution for
Attentive convolution mechanism is the fundamental opera- global feature representation is shown in Equation 4.
tion of ACT. It first performs n-gram convolution over text, g = F · max(M) (4)
then transforms text into convolutional filter space by com-
ndw
bining the filters attentively. With different utilization of fea- where g ∈ R and max means row-wise max-pooling.
ture maps as attention weights, attentive convolution mech-
anism is able to capture both local and global features of Comparison with existing methods Compared with con-
texts. The architecture of the proposed attentive convolution ventional CNN whose outputs come from feature maps
is shown in Figure 1 (left). only, our proposed attentive convolution utilizes both feature
maps and semantic meaning of convolutional filters for text
Local feature representation Given a text input t = representation. This allows our model to learn meaningful
[t1 , t2 , ..., tl ], we first represent each word token ti as word convolutional filters effectively since all the filters contribute
13262
Texts t1 t2 t3 t4 … tl <padding>
Input Embeddings (Q)
Input
Embeddings …
(Q)
Linear
N-gram
Convolution
Max Pooling
Feature Map … … Attentive Convolution
…
…
(M) h
× N
Convolutional Concat
Filters (F) …
f1 f2 fm
Linear
o1 o2 o3 o4 … ol g
Figure 1: Left: attentive convolution mechanism. Outputs are obtained by combining convolutional filters attentively utiliz-
ing feature map as attention weights. Right: multi-head multi-layer structure of ACT. h and N indicate number of attentive
convolution heads and layers respectively.
to the final representation directly. Moreover, the pooling op- Here, AttenConv indicates the proposed attentive convolu-
eration in conventional CNN ignores the sequential informa- tion mechanism, WiQ ∈ R(dw /h)×dw and WO ∈ Rdw ×ndw
tion of texts, whereas the local feature representation in our are the weight matrices of linear transformations. Further-
method preserves the sequential information while capturing more, we adopt the residual connection and layer norm as
important n-gram features. used in Vaswani et al. (2017). For multi-layer ACT, we sim-
Compared with conventional attention mechanism whose ply pass the local representations of lower-layer to the in-
attention weights are calculated from vector product of put of upper-layer to obtain the higher-level local represen-
queries (Q) and keys (K), our proposed method calculates tations. The global representation is obtained from the top
attention weights through convolution of queries (Q) using ACT layer.
the keys (F), where the keys and values in our attention The multi-head structure of ACT allows our model to
mechanism are convolutional filters learned during end-to- jointly capture important n-gram features in different sub-
end training. The convolution operation involves wider con- word spaces, where the n-grams in different spaces have dif-
text (n-grams) than the vector product of single words, this ferent contributions to the final representation. The multi-
allows our model to capture important n-gram features more layer structure allows our model to capture higher-level se-
effectively. These n-gram features are exactly the keywords mantics effectively. Since the upper-layer involves a wider
and phrases that are crucial for text classification. Besides, context for convolution, it is able to induce more abstract
as mentioned in Section 2.1, the output space is more sim- and discriminative representations.
plified and informative since it is formed by convolutional
filters that are invariant to the inputs. 2.3 Global Attention and Classification
2.2 Multi-head Multi-layer Attentive Convolution To obtain the final representation of texts for classification,
Inspired by Transformer (Vaswani et al. 2017), the pro- we propose a global attention mechanism that summarizes
posed ACT also has multi-head and multi-layer structures the sequential outputs of ACT. As shown in Figure 2, the
as shown in Figure 1 (right). attention weights are calculated by taking both local and
For h-head ACT, we first linearly transform input embed- global representations as well as position information of
dings Q h times and perform h attentive convolution simul- each token into consideration.
taneously. Then the outputs from different attention heads The local representation O ∈ Rdw ×l and global repre-
are concatenated together and linearly transformed to the sentation g ∈ Rdw are obtained from the top-layer of ACT.
original input dimension, as shown in Equation 5. The position embedding P ∈ Rdp ×l is obtained by map-
ping each token’s absolute position to dp -dimensional em-
MultiHead(Q) = WO Cat(O1 , O2 , ..., Oh ) beddings based on a trainable position embedding matrix
(5)
where Oi = AttenConv(WiQ Q) Wp ∈ Rdp ×P , where P is the total number of positions.
13263
Position News contains news articles from four categories: world, en-
Embedding
tertainment, sports, and business; DBPedia is an ontology
p1 p2 p3 … pl Global
Representation classification dataset containing 14 non-overlapping cate-
o1 o2 o3 … ol g
gories picked from DBpedia 2014.
Local
Representation … For relation extraction, we use TACRED and
SemEval2010-task8 (SemEval) datasets which contain
hand-annotated subject and object entities as well as the
relation type between the entities. TACRED is a large-scale
α1 α2 α3 αl and complex relation extraction dataset constructed by
Zhang et al. (2017b) which has 41 relation types and a
no relation class; SemEval2010-task8 (Hendrickx et al.
2009) is a relatively smaller relation extraction dataset
Final Representation which has 9 directed relations and 1 other relation.
13264
Average Train Test
Datasets Types Classes
lengths samples samples
Yelp Review Polarity (Yelp P.) Sentiment 2 156 560,000 38,000
Yelp Review Full (Yelp F.) Sentiment 5 158 650,000 50,000
AG’s News (AGNews) Topic 4 44 120,000 7,600
DBPedia Topic 14 55 560,000 70,000
TACRED Relation 41 36 90,755 15,509
SemEval2010-task8 (SemEval) Relation 19 19 8,000 2,717
Table 1: Statistics of the six text classification datasets used in our experiments.
Table 2: Left: classification accuracy (%) on sentiment analysis and topic categorization tasks. Right: F1 scores on relation
extraction task, official micro-averaged and macro-averaged F1 scores are used for TACRED and SemEval2010-task8 datasets
respectively. * means the results are obtained from our implementation. / means not reported. All other results are directly cited
from the respective papers mentioned in Section 3.2.
during training. The weight and learning rate for center loss 3.4 Results and Analysis
are 0.001 and 0.1 respectively. The models are trained using
SGD with initial learning rate of 0.01 and momentum of 0.9. Experiment results on the six text classification datasets are
Learning rate is decayed with a rate of 0.9 after 10 epochs shown in Table 2. Left table shows the classification accu-
if the score on the development set does not improve. Batch racy on sentiment analysis and topic categorization tasks;
size is set to 100 and the model is trained for 70 epochs. right table shows the F1 score on relation extraction task.
The dimensions of global attention and position embedding Our proposed ACT achieves the best performance among all
are 200 and 60 respectively. We use GeLUs (Hendrycks and the baseline models for majority of datasets. For SemEval
Gimpel 2016) for all the nonlinear activation functions. dataset, ACT ranks the 2nd best and has comparable per-
formance with C-GCN, a sophisticated model for relation
The hyper-parameters of ACT are selected by grid-search extraction.
(refer to Section 4.2 for details). Specifically, for senti- Compared with CNN-based models, ACT performs bet-
ment analysis and topic categorization, we set aside 10% of ter than shallow CNN (word/character-level), graph convo-
training data as the development set to tune model hyper- lution network (GCN), and deep CNN (VDCNN) with a sig-
parameters. We report the average classification accuracy on nificant margin. The reason is that ACT is able to capture
the test set based on 5 independent runs. For ACT, we use both local n-gram features and global dependencies effec-
3-layer encoder with 6 attentive convolution heads in each tively while preserving sequential information. Besides, the
layer, and m = 100 convolutional filters with a kernel size learning of convolutional filters is more efficient using the
of 3 in the attentive convolution mechanism. For relation ex- proposed attentive convolution mechanism where semantic
traction, we use the same settings as Zhang et al. (2017b) for meanings of the filters are utilized for text representation.
a fair comparison with baseline models. Particularly, instead Compared with RNN-based models, ACT consistently
of using absolute positions in global attention, we use two outperforms standard LSTM and improved variants of
relative positions for each token with respect to the two tar- LSTM (D-LSTM, Skim-LSTM, and PA-LSTM) for all the
get entities. Each relative position embedding has a dimen- tasks. This credits to the attentive convolution mechanism
sion of 30 and they are concatenated together as final po- for better capturing n-gram features, as well as the multi-
sition embedding. For ACT, we use one layer encoder with head multi-layer structure that does not suffer from gradient
6 attentive convolution heads in each layer, and m = 40 vanishing problem when capturing long-distance dependen-
convolutional filters with a kernel size of 3 in the attentive cies. The contextualized GCN (C-GCN) using bi-LSTM and
convolution mechanism. GCN performs slightly better than ACT on SemEval dataset,
13265
Figure 3: Hyper-parameter study on ACT. X-coordinate indicates the hyper-parameters studied, Y-coordinate indicates classifi-
cation accuracy for Yelp F. dataset and micro-averaged F1 score for TACRED dataset.
probably due to the benefits of dependency trees. Our model Model Yelp F. TACRED
does not require any dependency parsing of the sentences. ACT 68.3 67.8
It is observed that attentive models generally outperform 1. − Attentive Conv. 67.1 66.5
RNN-based models. This is due to the better ability of at- 2. − Multi-head 67.0 65.9
tention mechanisms in capturing long-distance dependen- 3. − Global rep. 67.6 67.1
cies, especially the self-attention used in Transformer. The 4. − Position embed. 67.4 63.5
proposed ACT outperforms all the attentive models includ-
ing Transformer encoder. The reason is that ACT has bet- Table 3: Ablation study on ACT. Accuracy (%) and micro-
ter local n-gram feature extraction capability by using at- averaged F1 score are reported on the development sets of
tentive convolution mechanism. However, important n-gram Yelp F. and TACRED respectively.
features may not be captured effectively by Transformer be-
cause each token will attend to the whole sequence instead
of n-grams, the output may be affected by irrelevant tokens. representation. (2) The proposed multi-head structure out-
Besides, ACT also simplifies the optimization because it performs single-head significantly, showing the effective-
transforms text representation from complex word space to ness of jointly capturing n-gram features in different sub-
more informative filter space, leading to more stable training word spaces in the multi-head structure. (3) Removing the
and better keyword extraction capability. global representation in global attention degrades the perfor-
The recently proposed knowledge-attention and self- mance by 1%. This demonstrates that incorporating global
attention integrated model (Li et al. 2019) performs as well representation into the attention mechanism yields better at-
as ACT on relation extraction task, with the aid of external tention weights for local representations. (4) After remov-
lexical resources to better capture the keywords of relations. ing the position embeddings in global attention, the perfor-
Encouragingly, our proposed ACT is able to capture such mance drops by 1.3% for Yelp F. and 6.3% for TACRED.
keywords effectively without the need of external knowl- This shows that position information is important for text
edge resources, yet achieving better performance. classification, especially for relation extraction task.
13266
Sample Sentences True Class Prediction
OBJ-PERSON returned to Buffalo in 1955 and was a part of a group of black intellectuals who included philosopher no relation
and poet SUBJ-PERSON SUBJ-PERSON , whom she married in 1958 . spouse
OBJ-PERSON returned to Buffalo in 1955 and was a part of a group of black intellectuals who included philosopher spouse
and poet SUBJ-PERSON SUBJ-PERSON , whom she married in 1958 .
When I worked at the Renaissance tower , I 'd come here when I was too lazy to walk down the street for something
5 star
better . Because , honestly , their pizza just is n't that great . Or good , really . But I 've had the breakfast muffin
twice and both times it was beyond awesome ! Just the right amount of grease to let you know it 's good . And super
3 star
cheap !
When I worked at the Renaissance tower , I 'd come here when I was too lazy to walk down the street for something
better . Because , honestly , their pizza just is n't that great . Or good , really . But I 've had the breakfast muffin
3 star
twice and both times it was beyond awesome ! Just the right amount of grease to let you know it 's good . And super
cheap !
Table 4: Attention visualization for Transformer and ACT. For each sample, the visualization of Transformer is presented first,
followed by our proposed ACT. Words are highlighted based on the attention weights assigned to them. Best viewed in color.
sensitive to number of convolutional filters. However, larger Model # param. Inf. time
dataset (Yelp F.) requires more filters than smaller dataset Transformer 3.38M 0.19s
(TACRED) to achieve the best performance. ACT 1.49M 0.07s
13267
References Mohamed, A.; Okhonko, D.; and Zettlemoyer, L. 2019.
Bilan, I.; and Roth, B. 2018. Position-aware Self-attention Transformers with convolutional context for ASR. arXiv
with Relative Positional Encodings for Slot Filling. arXiv preprint arXiv:1904.11660 .
preprint arXiv:1807.03052 . Nguyen, T. H.; and Grishman, R. 2015. Relation extrac-
Conneau, A.; Schwenk, H.; Barrault, L.; and Lecun, Y. 2016. tion: Perspective from convolutional neural networks. In
Very deep convolutional networks for natural language pro- Proceedings of the 1st Workshop on Vector Space Modeling
cessing. arXiv preprint arXiv:1606.01781 2. for Natural Language Processing, 39–48.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
BERT: Pre-training of Deep Bidirectional Transformers for Global vectors for word representation. In Proceedings of
Language Understanding. In Proceedings of the 2019 Con- the 2014 conference on empirical methods in natural lan-
ference of the North American Chapter of the Association guage processing (EMNLP), 1532–1543.
for Computational Linguistics: Human Language Technolo- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
gies, Volume 1 (Long and Short Papers), 4171–4186. Sutskever, I. 2019. Language models are unsupervised mul-
Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, titask learners. OpenAI Blog 1(8): 9.
J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. 2020. Con- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.;
former: Convolution-augmented Transformer for Speech Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Ex-
Recognition. arXiv preprint arXiv:2005.08100 . ploring the limits of transfer learning with a unified text-
Guo, M.; Zhang, Y.; and Liu, T. 2019. Gaussian transformer: to-text transformer. Journal of Machine Learning Research
a lightweight approach for natural language inference. In 21(140): 1–67.
Proceedings of the AAAI Conference on Artificial Intelli- Seo, M.; Min, S.; Farhadi, A.; and Hajishirzi, H. 2018. Neu-
gence, volume 33, 6489–6496. ral speed reading via skim-rnn. International Conference on
Hendrickx, I.; Kim, S. N.; Kozareva, Z.; Nakov, P.; Learning Representations .
Ó Séaghdha, D.; Padó, S.; Pennacchiotti, M.; Romano, L.;
Shen, T.; Zhou, T.; Long, G.; Jiang, J.; and Zhang, C. 2018.
and Szpakowicz, S. 2009. Semeval-2010 task 8: Multi-way
Bi-directional block self-attention for fast and memory-
classification of semantic relations between pairs of nomi-
efficient sequence modeling. In International Conference
nals. In Proceedings of the Workshop on Semantic Evalu-
on Representation Learning.
ations: Recent Achievements and Future Directions, 94–99.
Association for Computational Linguistics. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear Salakhutdinov, R. 2014. Dropout: a simple way to prevent
units (gelus). arXiv preprint arXiv:1606.08415 . neural networks from overfitting. The Journal of Machine
Learning Research 15(1): 1929–1958.
Kim, Y. 2014. Convolutional Neural Networks for Sen-
tence Classification. In Proceedings of the 2014 Confer- Tang, D.; Qin, B.; and Liu, T. 2015. Document modeling
ence on Empirical Methods in Natural Language Processing with gated recurrent neural network for sentiment classifi-
(EMNLP), 1746–1751. cation. In Proceedings of the 2015 conference on empirical
methods in natural language processing, 1422–1432.
Le, H. T.; Cerisara, C.; and Denis, A. 2018. Do convolu-
tional networks need to be deep for text classification? In Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Workshops at the Thirty-Second AAAI Conference on Artifi- L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
cial Intelligence. tention is all you need. In Advances in neural information
processing systems, 5998–6008.
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P.
2017. Pruning Filters for Efficient ConvNets. In 5th In- Wang, G.; Li, C.; Wang, W.; Zhang, Y.; Shen, D.; Zhang, X.;
ternational Conference on Learning Representations, ICLR Henao, R.; and Carin, L. 2018. Joint Embedding of Words
2017. and Labels for Text Classification. In Proceedings of the
56th Annual Meeting of the Association for Computational
Li, P.; and Mao, K. 2019. Knowledge-oriented convolutional
Linguistics (Volume 1: Long Papers), 2321–2331.
neural network for causal relation extraction from natural
language texts. Expert Systems with Applications 115: 512– Wang, S.; and Manning, C. D. 2012. Baselines and bigrams:
523. Simple, good sentiment and topic classification. In Proceed-
ings of the 50th annual meeting of the association for com-
Li, P.; Mao, K.; Yang, X.; and Li, Q. 2019. Improving Rela-
putational linguistics: Short papers-volume 2, 90–94. Asso-
tion Extraction with Knowledge-attention. In Proceedings of
ciation for Computational Linguistics.
the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Confer- Wen, Y.; Zhang, K.; Li, Z.; and Qiao, Y. 2016. A dis-
ence on Natural Language Processing (EMNLP-IJCNLP), criminative feature learning approach for deep face recogni-
229–239. tion. In European conference on computer vision, 499–515.
Li, Q.; Li, P.; Mao, K.; and Lo, E. Y.-M. 2020. Improving Springer.
convolutional neural network for text classification by recur- Yang, B.; Tu, Z.; Wong, D. F.; Meng, F.; Chao, L. S.; and
sive data pruning. Neurocomputing 414: 143–152. Zhang, T. 2018. Modeling Localness for Self-Attention Net-
13268
works. In Proceedings of the 2018 Conference on Empirical Zhong, P.; Wang, D.; and Miao, C. 2019. Knowledge-
Methods in Natural Language Processing, 4449–4458. Enriched Transformer for Emotion Detection in Textual
Conversations. In Proceedings of the 2019 Conference on
Yang, B.; Wang, L.; Wong, D. F.; Chao, L. S.; and Tu, Z.
Empirical Methods in Natural Language Processing and
2019a. Convolutional Self-Attention Networks. In Proceed-
the 9th International Joint Conference on Natural Language
ings of the 2019 Conference of the North American Chapter
Processing (EMNLP-IJCNLP), 165–176.
of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; and Xu, B.
4040–4045. 2016. Attention-based bidirectional long short-term mem-
ory networks for relation classification. In Proceedings of
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, the 54th Annual Meeting of the Association for Computa-
R.; and Le, Q. V. 2019b. XLNet: Generalized Autoregres- tional Linguistics (Volume 2: Short Papers), volume 2, 207–
sive Pretraining for Language Understanding. arXiv preprint 212.
arXiv:1906.08237 .
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy,
E. 2016. Hierarchical Attention Networks for Document
Classification. In Proceedings of the 2016 Conference of
the North American Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies, 1480–
1489. San Diego, California: Association for Computational
Linguistics. doi:10.18653/v1/N16-1174. URL https://www.
aclweb.org/anthology/N16-1174.
Yogatama, D.; Dyer, C.; Ling, W.; and Blunsom, P. 2017.
Generative and discriminative text classification with recur-
rent neural networks. In Thirty-fourth International Confer-
ence on Machine Learning (ICML 2017). International Ma-
chine Learning Society.
Yu, A. W.; Dohan, D.; Luong, M.-T.; Zhao, R.; Chen, K.;
Norouzi, M.; and Le, Q. V. 2018. QANet: Combining Local
Convolution with Global Self-Attention for Reading Com-
prehension. In International Conference on Learning Rep-
resentations.
Yu, F.; and Koltun, V. 2015. Multi-scale context aggregation
by dilated convolutions. arXiv preprint arXiv:1511.07122 .
Zhang, H.; Xiao, L.; Wang, Y.; and Jin, Y. 2017a. A gen-
eralized recurrent neural architecture for text classification
with multi-task learning. In Proceedings of the 26th Inter-
national Joint Conference on Artificial Intelligence, 3385–
3391. AAAI Press.
Zhang, J.; Luan, H.; Sun, M.; Zhai, F.; Xu, J.; Zhang, M.;
and Liu, Y. 2018. Improving the Transformer Translation
Model with Document-Level Context. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Lan-
guage Processing, 533–542.
Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level
convolutional networks for text classification. In Advances
in neural information processing systems, 649–657.
Zhang, Y.; Qi, P.; and Manning, C. D. 2018. Graph Convolu-
tion over Pruned Dependency Trees Improves Relation Ex-
traction. In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, 2205–2215.
Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning,
C. D. 2017b. Position-aware attention and supervised data
improve slot filling. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing, 35–
45.
13269