ACT Text Classification 20210518

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

ACT: an Attentive Convolutional Transformer for Efficient Text Classification

Pengfei Li,1 Peixiang Zhong,1 Kezhi Mao,1 *


Dongzhe Wang,2 Xuefeng Yang,2 Yunfeng Liu,2 Jianxiong Yin,3 Simon See3
1
Nanyang Technological University, Singapore
2
ZhuiYi Technology, Shenzhen, China, 3 NVIDIA AI Tech Center
{pli006,peixiang001,ekzmao}@ntu.edu.sg, {ethanwang,ryan,glenliu}@wezhuiyi.com, {jianxiongy,ssee}@nvidia.com

Abstract capturing semantic and syntactic information of n-grams for


more abstract and discriminative representations (Kim 2014;
Recently, Transformer has been demonstrating promising Zhang, Zhao, and LeCun 2015; Li and Mao 2019). However,
performance in many NLP tasks and showing a trend of re-
placing Recurrent Neural Network (RNN). Meanwhile, less
CNN is relatively weak in capturing sequential information
attention is drawn to Convolutional Neural Network (CNN) and long-distance dependencies because convolutional fil-
due to its weak ability in capturing sequential and long- ters normally have small kernel sizes focusing only on local
distance dependencies, although it has excellent local feature n-grams, and the pooling operation results in loss of position
extraction capability. In this paper, we introduce an Attentive information. Although we could apply dilated CNN (Yu and
Convolutional Transformer (ACT) that takes the advantages Koltun 2015) or construct deep CNNs with one layer stack
of both Transformer and CNN for efficient text classification. on another to widen the convolution context to some extent,
Specifically, we propose a novel attentive convolution mech- the performance gain is normally marginal with the cost of
anism that utilizes the semantic meaning of convolutional fil- more data needed (Le, Cerisara, and Denis 2018). Besides,
ters attentively to transform text from complex word space to the convolutional filters in CNN may misfit to task-irrelevant
a more informative convolutional filter space where important
n-grams are captured. ACT is able to capture both local and
words, hence producing non-discriminative features in the
global dependencies effectively while preserving sequential feature map (Li et al. 2017, 2020).
information. Experiments on various text classification tasks RNN is well-known for processing sequential data recur-
and detailed analyses show that ACT is a lightweight, fast, rently and it is widely used for text classification (Tang, Qin,
and effective universal text classifier, outperforming CNNs, and Liu 2015; Yogatama et al. 2017; Zhang et al. 2017a).
RNNs, and attentive models including Transformer. However, RNN suffers from two problems due to its re-
current nature: gradient vanishing and parallel-unfriendly.
1 Introduction Many works attempt to alleviate the gradient vanishing
problem by incorporating attention mechanisms to RNN
Text classification is a fundamental problem behind many (Zhou et al. 2016; Yang et al. 2016; Zhang et al. 2017b). A
research topics in Natural Language Processing (NLP), such novel neural architecture called Transformer (Vaswani et al.
as topic categorization, sentiment analysis, relation extrac- 2017) addresses both problems by relying entirely on self-
tion, etc. The key issue in text classification is text represen- attention to handle long-distance dependencies without re-
tation learning, which aims to capture both local and global current computations. The emerging of Transformer-based
dependencies of texts with respect to class labels. Com- neural networks has led to a series of breakthroughs in a
pared with traditional bag-of-words/n-grams model (Wang wide range of NLP tasks (Zhang et al. 2018; Li et al. 2019;
and Manning 2012), deep neural networks have shown to be Zhong, Wang, and Miao 2019). Especially, the pre-trained
more effective since word order information can be utilized language models based on Transformer have achieved state-
and more semantic features can be captured. The commonly of-the-art performance in many benchmark datasets (Devlin
adopted neural architectures in deep neural networks include et al. 2019; Radford et al. 2019; Raffel et al. 2020). However,
CNN, RNN, and Transformer. the heavy architecture of Transformer often requires more
CNN is a special feed-forward neural network with con- training data, CPU/GPU memory, and computational power,
volutional layers interleaved with pooling layers. For NLP, especially for long texts. Besides, since self-attention takes
the convolutional kernels/filters in CNN can be treated as n- into account all the elements with a weighted averaging op-
gram extractors that convert n-gram in each position into a eration that disperses the attention distribution, Transformer
vector showing its relevance to the filters. With the help of may overlook the relation of neighboring elements (i.e. n-
pooling operations, the overall relevance of the text to each grams) that are important for text classification tasks (Yang
filter can be captured. Therefore, CNN has advantages in et al. 2018, 2019a; Guo, Zhang, and Liu 2019).
* Corresponding author. To address the above-mentioned limitations of CNN and
Copyright © 2021, Association for the Advancement of Artificial Transformer, we propose an Attentive Convolutional Trans-
Intelligence (www.aaai.org). All rights reserved. former (ACT) which takes the advantages of both Trans-

13261
former and CNN for efficient text classification. Similar embedding qi ∈ Rdw and obtain the input embeddings
as Transformer, ACT also has a multi-head structure that Q = [q1 , q2 , ..., ql ] by looking up the word embedding
jointly performs attention operations in different subspaces. matrix Wwrd ∈ Rdw ×V , where dw is the dimension of
However, instead of self-attention, a novel attentive convolu- word embedding and V is vocabulary size. Then, n-gram
tion mechanism is performed in each attention head to better convolution over input embeddings Q is performed using
capture local n-gram features. Different from conventional convolutional filters F = [f1 , f2 , ..., fm ], where fi ∈ Rndw
CNN, the proposed attentive convolution utilizes the seman- and n is the convolution kernel size. A feature map matrix
tic meaning of convolutional filters attentively and trans- M ∈ Rm×l is generated as follows:
forms texts from complex word space to a more informative
convolutional filter space. This not only simplifies the opti- M=Q~F (1)
mization of capturing important n-grams for classification, where ~ indicates the convolution operation of fi over Q.
but also allows our model to learn meaningful convolutional Specifically, the value in the feature map is calculated as
filters since all the filters contribute to the final representa- shown in Equation 2:
tion directly. Compared with self-attention, the proposed at-
tentive convolution focuses more on learning important local mij = f (fi T · Cat(qj , qj+1 , ..., qj+n−1 ) + b) (2)
n-gram features globally which are invariant to the specific
inputs. These n-gram features are exactly the keywords and where Cat means concatenation, f is a non-linear activation
phrases that are crucial for text classification. While majority function and b is a bias term.
of existing works augment Transformer with conventional The values in the resulted feature map indicate seman-
CNNs to improve locality modeling capability with the cost tic relevance between n-grams and convolutional filters. By
of introducing more parameters (Yu et al. 2018; Mohamed, treating the feature map values as attention weights and ag-
Okhonko, and Zettlemoyer 2019; Yang et al. 2019a; Gulati gregating the semantic convolutional filters attentively, we
et al. 2020), our work is a more lightweight approach and it transform each n-gram from complex word space to a more
is the first to utilize the semantic meaning of convolutional informative convolutional filter space while preserving the
filters with attention mechanism. sequential information of texts. Formally, the attentive con-
The proposed ACT is also sequence-to-sequence, with an volution for local feature representation is shown in Equa-
additional global representation output by keeping the max- tion 3.
pooling functionality of CNN. Therefore, it is able to cap- O = F · M = F · (Q ~ F) (3)
ndw ×l
ture both local and global features while preserving sequen- where O = [o1 , o2 , ..., ol ] ∈ R is the output obtained
tial information. Furthermore, we propose a global attention from attentive convolution.
mechanism to summarize the outputs of ACT and obtain the Different from self-attention whose output space is still
final representation by taking local, global, and position in- a complex word space with varying components depending
formation into consideration. Experiments are conducted on on the input, the output space in our proposed attentive con-
typical text classification tasks including sentiment analysis volution mechanism is formed by n-gram convolutional fil-
and topic categorization, as well as the more challenging re- ters which are learned globally and invariant to the inputs.
lation extraction task. We present detailed analyses on ACT, In such space, important n-grams will be close to the corre-
results show that ACT is a lightweight and efficient univer- sponding filters and irrelevant n-grams will have small val-
sal text classifier, outperforming existing CNN-based, RNN- ues. Therefore, the important local features (n-grams) appear
based, and attentive models including Transformer. in the texts can be captured effectively.
Global feature representation Besides local features, at-
2 Attentive Convolutional Transformer tentive convolution mechanism can also capture global fea-
We present the proposed ACT in detail in this section. The tures of texts by applying the max-pooling technique which
attentive convolution mechanism of ACT is introduced in is normally used in conventional CNNs. The max-pooling
Section 2.1; the multi-head multi-layer structure of ACT is over each row of the feature map M finds the overall rel-
described in Section 2.2; the global attention mechanism for evance of the texts to each convolutional filter. By aggre-
final text representation is presented in Section 2.3. gating the convolutional filters attentively using the max-
pooling results, we can find the overall semantics of texts
2.1 Attentive Convolution Mechanism in the filter space. Formally, the attentive convolution for
Attentive convolution mechanism is the fundamental opera- global feature representation is shown in Equation 4.
tion of ACT. It first performs n-gram convolution over text, g = F · max(M) (4)
then transforms text into convolutional filter space by com-
ndw
bining the filters attentively. With different utilization of fea- where g ∈ R and max means row-wise max-pooling.
ture maps as attention weights, attentive convolution mech-
anism is able to capture both local and global features of Comparison with existing methods Compared with con-
texts. The architecture of the proposed attentive convolution ventional CNN whose outputs come from feature maps
is shown in Figure 1 (left). only, our proposed attentive convolution utilizes both feature
maps and semantic meaning of convolutional filters for text
Local feature representation Given a text input t = representation. This allows our model to learn meaningful
[t1 , t2 , ..., tl ], we first represent each word token ti as word convolutional filters effectively since all the filters contribute

13262
Texts t1 t2 t3 t4 … tl <padding>
Input Embeddings (Q)
Input
Embeddings …
(Q)
Linear

N-gram
Convolution

Max Pooling
Feature Map … … Attentive Convolution


(M) h

× N
Convolutional Concat
Filters (F) …
f1 f2 fm
Linear

Add and Norm


Local … Global
Representation
Representation

o1 o2 o3 o4 … ol g

Figure 1: Left: attentive convolution mechanism. Outputs are obtained by combining convolutional filters attentively utiliz-
ing feature map as attention weights. Right: multi-head multi-layer structure of ACT. h and N indicate number of attentive
convolution heads and layers respectively.

to the final representation directly. Moreover, the pooling op- Here, AttenConv indicates the proposed attentive convolu-
eration in conventional CNN ignores the sequential informa- tion mechanism, WiQ ∈ R(dw /h)×dw and WO ∈ Rdw ×ndw
tion of texts, whereas the local feature representation in our are the weight matrices of linear transformations. Further-
method preserves the sequential information while capturing more, we adopt the residual connection and layer norm as
important n-gram features. used in Vaswani et al. (2017). For multi-layer ACT, we sim-
Compared with conventional attention mechanism whose ply pass the local representations of lower-layer to the in-
attention weights are calculated from vector product of put of upper-layer to obtain the higher-level local represen-
queries (Q) and keys (K), our proposed method calculates tations. The global representation is obtained from the top
attention weights through convolution of queries (Q) using ACT layer.
the keys (F), where the keys and values in our attention The multi-head structure of ACT allows our model to
mechanism are convolutional filters learned during end-to- jointly capture important n-gram features in different sub-
end training. The convolution operation involves wider con- word spaces, where the n-grams in different spaces have dif-
text (n-grams) than the vector product of single words, this ferent contributions to the final representation. The multi-
allows our model to capture important n-gram features more layer structure allows our model to capture higher-level se-
effectively. These n-gram features are exactly the keywords mantics effectively. Since the upper-layer involves a wider
and phrases that are crucial for text classification. Besides, context for convolution, it is able to induce more abstract
as mentioned in Section 2.1, the output space is more sim- and discriminative representations.
plified and informative since it is formed by convolutional
filters that are invariant to the inputs. 2.3 Global Attention and Classification
2.2 Multi-head Multi-layer Attentive Convolution To obtain the final representation of texts for classification,
Inspired by Transformer (Vaswani et al. 2017), the pro- we propose a global attention mechanism that summarizes
posed ACT also has multi-head and multi-layer structures the sequential outputs of ACT. As shown in Figure 2, the
as shown in Figure 1 (right). attention weights are calculated by taking both local and
For h-head ACT, we first linearly transform input embed- global representations as well as position information of
dings Q h times and perform h attentive convolution simul- each token into consideration.
taneously. Then the outputs from different attention heads The local representation O ∈ Rdw ×l and global repre-
are concatenated together and linearly transformed to the sentation g ∈ Rdw are obtained from the top-layer of ACT.
original input dimension, as shown in Equation 5. The position embedding P ∈ Rdp ×l is obtained by map-
ping each token’s absolute position to dp -dimensional em-
MultiHead(Q) = WO Cat(O1 , O2 , ..., Oh ) beddings based on a trainable position embedding matrix
(5)
where Oi = AttenConv(WiQ Q) Wp ∈ Rdp ×P , where P is the total number of positions.

13263
Position News contains news articles from four categories: world, en-
Embedding
tertainment, sports, and business; DBPedia is an ontology
p1 p2 p3 … pl Global
Representation classification dataset containing 14 non-overlapping cate-
o1 o2 o3 … ol g
gories picked from DBpedia 2014.
Local
Representation … For relation extraction, we use TACRED and
SemEval2010-task8 (SemEval) datasets which contain
hand-annotated subject and object entities as well as the
relation type between the entities. TACRED is a large-scale
α1 α2 α3 αl and complex relation extraction dataset constructed by
Zhang et al. (2017b) which has 41 relation types and a
no relation class; SemEval2010-task8 (Hendrickx et al.
2009) is a relatively smaller relation extraction dataset
Final Representation which has 9 directed relations and 1 other relation.

r 3.2 Baseline Models


A variety of baseline models are used for comparison with
Figure 2: Global attention mechanism. Attention weights αi our model. Different baseline models are used for relation
are calculated based on local representation oi , global rep- extraction since the task is more challenging and normally
resentation g, and position embedding pi of each token. requires dedicated models.
Text Classification Models
The final text representation is obtained by Equation 6: • CNN-based models including Word-level CNN, Char-
OT g level CNN (Zhang, Zhao, and LeCun 2015), and deep
r = O · Softmax(f (Wo O + Wp P)T c + √ ) (6) CNN namely VDCNN (Conneau et al. 2016).
dw
• RNN-based models including standard LSTM (Zhang,
where f is a non-linear activation function, Wo ∈ Rda ×dw Zhao, and LeCun 2015), discriminative LSTM (D-LSTM)
Wp ∈ Rda ×dp are linear transformation weight matrices, da of Yogatama et al. (2017), and Skim-LSTM which dy-
is attention dimension, c√∈ Rda is a context vector learned namically updates its hidden states (Seo et al. 2018).
by the neural network, dw is a scaling factor depends on • Attentive models including bi-directional block self-
input dimension. attention network (Bi-BloSAN) (Shen et al. 2018), label-
For classification, we pass the final representation r to a embedding attentive model (LEAM) (Wang et al. 2018),
classifier consisting of a fully connected layer and a soft- and Transformer encoder (Vaswani et al. 2017) for text
max layer to predict class probabilities. Our model is trained classification.
by minimizing categorical cross-entropy loss and center loss
(Wen et al. 2016) using stochastic gradient descent (SGD) Relation Extraction Models
with momentum and learning rate decay. • CNN-based models including the standard CNN for sen-
tence classification (Kim 2014), CNN with position em-
3 Experiments beddings (CNN-PE) (Nguyen and Grishman 2015), and
We evaluate our proposed ACT on three different text clas- graph convolutional network (GCN) over pruned depen-
sification tasks, including sentiment analysis, topic catego- dency trees of sentences (Zhang, Qi, and Manning 2018).
rization, and relation extraction. Since relation extraction • RNN-based models including standard LSTM and
is slightly different from traditional text classification tasks LSTM with position-aware attention (PA-LSTM) (Zhang
where special considerations are needed for target entities, et al. 2017b).
we conduct experiments on it separately. • CNN-RNN hybrid model including contextualized GCN
(C-GCN) where the input vectors are obtained using bi-
3.1 Datasets LSTM (Zhang, Qi, and Manning 2018).
We use six widely-studied datasets to evaluate our model,
• Attentive models including Transformer encoder (Bi-
two for each text classification task. These datasets are di-
lan and Roth 2018), knowledge-attention encoder (Knwl-
verse in the aspects of type, size, number of classes, and
attn) (Li et al. 2019), and knowledge-attention self-
document length. Table 1 shows the statistics of the datasets.
attention integrated model (Knwl+Self).
For sentiment analysis, we use two datasets constructed
by Zhang et al. (2015) which are obtained from Yelp Dataset
Challenge 2015. Yelp Review Polarity (Yelp P.) is a binary 3.3 Experiment Settings
sentiment classification dataset whose class is either positive In our experiments, word embedding matrix Wwrd is ini-
or negative; Yelp Review Full (Yelp F.) contains more fine- tialized with 300-d Glove word embeddings (Pennington,
grained sentiment classes ranging from rating 1 to 5. Socher, and Manning 2014). The fully connected layer be-
For topic categorization, we use AG’s News (AGNews) fore softmax has a dimension of 100. Dropout regulariza-
and DBPedia datasets created by Zhang et al. (2015). AG- tion (Srivastava et al. 2014) with a rate of 0.4 is applied

13264
Average Train Test
Datasets Types Classes
lengths samples samples
Yelp Review Polarity (Yelp P.) Sentiment 2 156 560,000 38,000
Yelp Review Full (Yelp F.) Sentiment 5 158 650,000 50,000
AG’s News (AGNews) Topic 4 44 120,000 7,600
DBPedia Topic 14 55 560,000 70,000
TACRED Relation 41 36 90,755 15,509
SemEval2010-task8 (SemEval) Relation 19 19 8,000 2,717

Table 1: Statistics of the six text classification datasets used in our experiments.

Model Yelp P. Yelp F. AGNews DBPedia Model TACRED SemEval


Word-level CNN 95.40 59.84 91.45 98.58 CNN 59.3* 70.0*
Char-level CNN 94.75 61.60 90.15 98.34 CNN-PE 61.4* 82.3*
VDCNN 95.72 64.72 91.33 98.71 GCN 64.0 /
LSTM 94.74 58.17 86.06 98.55 LSTM 61.5* 80.9*
D-LSTM 92.60 59.60 92.10 98.70 PA-LSTM 65.1 82.7
Skim-LSTM / / 93.60 / C-GCN 66.4 84.8
Bi-BloSAN 94.56 62.13 93.32 98.77 Knwl-attn 66.4 82.3
LEAM 95.31 64.09 92.45 99.02 Knwl+Self 67.8 84.3
Transformer 96.13* 65.34* 93.89* 98.98* Transformer 66.5 83.1
ACT 97.41 68.16 94.25 99.19 ACT 68.0 84.5

Table 2: Left: classification accuracy (%) on sentiment analysis and topic categorization tasks. Right: F1 scores on relation
extraction task, official micro-averaged and macro-averaged F1 scores are used for TACRED and SemEval2010-task8 datasets
respectively. * means the results are obtained from our implementation. / means not reported. All other results are directly cited
from the respective papers mentioned in Section 3.2.

during training. The weight and learning rate for center loss 3.4 Results and Analysis
are 0.001 and 0.1 respectively. The models are trained using
SGD with initial learning rate of 0.01 and momentum of 0.9. Experiment results on the six text classification datasets are
Learning rate is decayed with a rate of 0.9 after 10 epochs shown in Table 2. Left table shows the classification accu-
if the score on the development set does not improve. Batch racy on sentiment analysis and topic categorization tasks;
size is set to 100 and the model is trained for 70 epochs. right table shows the F1 score on relation extraction task.
The dimensions of global attention and position embedding Our proposed ACT achieves the best performance among all
are 200 and 60 respectively. We use GeLUs (Hendrycks and the baseline models for majority of datasets. For SemEval
Gimpel 2016) for all the nonlinear activation functions. dataset, ACT ranks the 2nd best and has comparable per-
formance with C-GCN, a sophisticated model for relation
The hyper-parameters of ACT are selected by grid-search extraction.
(refer to Section 4.2 for details). Specifically, for senti- Compared with CNN-based models, ACT performs bet-
ment analysis and topic categorization, we set aside 10% of ter than shallow CNN (word/character-level), graph convo-
training data as the development set to tune model hyper- lution network (GCN), and deep CNN (VDCNN) with a sig-
parameters. We report the average classification accuracy on nificant margin. The reason is that ACT is able to capture
the test set based on 5 independent runs. For ACT, we use both local n-gram features and global dependencies effec-
3-layer encoder with 6 attentive convolution heads in each tively while preserving sequential information. Besides, the
layer, and m = 100 convolutional filters with a kernel size learning of convolutional filters is more efficient using the
of 3 in the attentive convolution mechanism. For relation ex- proposed attentive convolution mechanism where semantic
traction, we use the same settings as Zhang et al. (2017b) for meanings of the filters are utilized for text representation.
a fair comparison with baseline models. Particularly, instead Compared with RNN-based models, ACT consistently
of using absolute positions in global attention, we use two outperforms standard LSTM and improved variants of
relative positions for each token with respect to the two tar- LSTM (D-LSTM, Skim-LSTM, and PA-LSTM) for all the
get entities. Each relative position embedding has a dimen- tasks. This credits to the attentive convolution mechanism
sion of 30 and they are concatenated together as final po- for better capturing n-gram features, as well as the multi-
sition embedding. For ACT, we use one layer encoder with head multi-layer structure that does not suffer from gradient
6 attentive convolution heads in each layer, and m = 40 vanishing problem when capturing long-distance dependen-
convolutional filters with a kernel size of 3 in the attentive cies. The contextualized GCN (C-GCN) using bi-LSTM and
convolution mechanism. GCN performs slightly better than ACT on SemEval dataset,

13265
Figure 3: Hyper-parameter study on ACT. X-coordinate indicates the hyper-parameters studied, Y-coordinate indicates classifi-
cation accuracy for Yelp F. dataset and micro-averaged F1 score for TACRED dataset.

probably due to the benefits of dependency trees. Our model Model Yelp F. TACRED
does not require any dependency parsing of the sentences. ACT 68.3 67.8
It is observed that attentive models generally outperform 1. − Attentive Conv. 67.1 66.5
RNN-based models. This is due to the better ability of at- 2. − Multi-head 67.0 65.9
tention mechanisms in capturing long-distance dependen- 3. − Global rep. 67.6 67.1
cies, especially the self-attention used in Transformer. The 4. − Position embed. 67.4 63.5
proposed ACT outperforms all the attentive models includ-
ing Transformer encoder. The reason is that ACT has bet- Table 3: Ablation study on ACT. Accuracy (%) and micro-
ter local n-gram feature extraction capability by using at- averaged F1 score are reported on the development sets of
tentive convolution mechanism. However, important n-gram Yelp F. and TACRED respectively.
features may not be captured effectively by Transformer be-
cause each token will attend to the whole sequence instead
of n-grams, the output may be affected by irrelevant tokens. representation. (2) The proposed multi-head structure out-
Besides, ACT also simplifies the optimization because it performs single-head significantly, showing the effective-
transforms text representation from complex word space to ness of jointly capturing n-gram features in different sub-
more informative filter space, leading to more stable training word spaces in the multi-head structure. (3) Removing the
and better keyword extraction capability. global representation in global attention degrades the perfor-
The recently proposed knowledge-attention and self- mance by 1%. This demonstrates that incorporating global
attention integrated model (Li et al. 2019) performs as well representation into the attention mechanism yields better at-
as ACT on relation extraction task, with the aid of external tention weights for local representations. (4) After remov-
lexical resources to better capture the keywords of relations. ing the position embeddings in global attention, the perfor-
Encouragingly, our proposed ACT is able to capture such mance drops by 1.3% for Yelp F. and 6.3% for TACRED.
keywords effectively without the need of external knowl- This shows that position information is important for text
edge resources, yet achieving better performance. classification, especially for relation extraction task.

4 Discussions 4.2 Hyper-parameter Study


We present more in-depth analyses and discussions on ACT In this section, we study the influence of some important
in this section. Two relatively different datasets are used to hyper-parameters on the performance of ACT, including
conduct our experiments: one is Yelp F., a large dataset for number of layers, number of attentive convolution heads,
sentiment analysis; the other is TACRED, a relation extrac- kernel size, and number of filters in attentive convolution.
tion dataset which is much smaller. We report accuracy and Experiment results are shown in Figure 3.
micro-averaged F1 score on the development sets of Yelp F. It is observed that the number of ACT layers affects the
and TACRED respectively. performance significantly. For small datasets like TACRED,
single-layer ACT achieves the best performance. For large
4.1 Ablation Study datasets like Yelp F., the optimal number of layers is 3. Fur-
ther increasing the number of layers will increase model
We perform an ablation study on ACT to investigate the con- complexity and cause performance drop due to overfitting.
tributions of specific components of ACT. Results are shown Besides, multiple attentive convolution heads are benefi-
in Table 3. cial for ACT, the optimal number of heads is 6. For ker-
(1) We replace the proposed attentive convolution mech- nel size, results show that 3-gram convolution is most ef-
anism with conventional CNN where feature maps are used fective for ACT.1 It is also observed that ACT is not very
for text representation directly, the performance drop by 1.8-
1.9%. This demonstrates the advantage of utilizing the se- 1
We also tried using multiple kernel sizes simultaneously, re-
mantic meaning of convolutional filters attentively for text sults show no improvements over single kernel size.

13266
Sample Sentences True Class Prediction
OBJ-PERSON returned to Buffalo in 1955 and was a part of a group of black intellectuals who included philosopher no relation
and poet SUBJ-PERSON SUBJ-PERSON , whom she married in 1958 . spouse
OBJ-PERSON returned to Buffalo in 1955 and was a part of a group of black intellectuals who included philosopher spouse
and poet SUBJ-PERSON SUBJ-PERSON , whom she married in 1958 .

When I worked at the Renaissance tower , I 'd come here when I was too lazy to walk down the street for something
5 star
better . Because , honestly , their pizza just is n't that great . Or good , really . But I 've had the breakfast muffin
twice and both times it was beyond awesome ! Just the right amount of grease to let you know it 's good . And super
3 star
cheap !

When I worked at the Renaissance tower , I 'd come here when I was too lazy to walk down the street for something
better . Because , honestly , their pizza just is n't that great . Or good , really . But I 've had the breakfast muffin
3 star
twice and both times it was beyond awesome ! Just the right amount of grease to let you know it 's good . And super
cheap !

Table 4: Attention visualization for Transformer and ACT. For each sample, the visualization of Transformer is presented first,
followed by our proposed ACT. Words are highlighted based on the attention weights assigned to them. Best viewed in color.

sensitive to number of convolutional filters. However, larger Model # param. Inf. time
dataset (Yelp F.) requires more filters than smaller dataset Transformer 3.38M 0.19s
(TACRED) to achieve the best performance. ACT 1.49M 0.07s

4.3 Attention Visualization Table 5: Comparison of model parameters and inference


time per batch on Yelp F. dataset.
To investigate what ACT focuses on, as well as its differ-
ence from Transformer, we conduct visualization of atten-
tion weights assigned to words. We sample sentences from language models based on Transformer such as BERT (De-
the development sets of Yelp F. and TACRED. Two of the vlin et al. 2019) and XLNet (Yang et al. 2019b) have
visualizations are shown in Table 4. achieved start-of-the-art performance in many NLP tasks,
The visualization results show that the proposed ACT can the memory and speed constraints will become obstacles for
capture the keywords and cue phrases more effectively than practical applications.
Transformer. It is observed that Transformer attends to a
wide range of words in the sentence, including stop words 5 Conclusion and Future Work
and punctuations which may be irrelevant for the classifica-
tion task. On the contrary, ACT pays more attention to the We introduce an Attentive Convolutional Transformer
important n-grams such as “married” for “spouse” relation (ACT) for efficient text classification. By taking the advan-
and “is n’t” for fine-grained sentiment classification. These tages of both Transformer and CNN, ACT is able to capture
n-grams are the keywords and cue phrases of certain class both local and global dependencies effectively while pre-
which are crucial for classification tasks. serving sequential information of texts. Particularly, a novel
attentive convolution mechanism is proposed to better cap-
4.4 Model Size and Inference Speed ture n-gram features in convolutional filter space. We also
propose a global attention mechanism to obtain the final rep-
In this section, we investigate two practical aspects of our resentation by taking local, global, and position informa-
model for real-world applications: model size and inference tion into consideration. Detailed analyses show that ACT
speed. For model size, we report the number of model pa- is a lightweight and efficient universal text classifier that
rameters. For inference speed, we report the average time achieves consistently good results over different text classi-
needed to compute a single batch (batch size of 100) of Yelp fication tasks, outperforming CNN-based, RNN-based, and
F. dataset using NVIDIA Tesla P40 GPU with Intel Xeon attentive models including Transformer.
E5-2667 CPU. We also compare our model with Trans- Although our proposed ACT is dedicated for text classi-
former under the same hyper-parameter settings as described fication tasks where local feature extraction capability is of
in Section 3.3, results are shown in Table 5. particular importance, we will explore the potential applica-
The proposed ACT is much smaller and faster compared tions of ACT on other NLP tasks such as machine transla-
with Transformer. It has 56% fewer parameters and 2.7 times tion, text summarization, and language modeling in future
faster inference speed. Therefore, ACT is a light-weight and work. Furthermore, we will apply the idea of the proposed
efficient model for text classification, and it is more practical attentive convolution mechanism to other fields beyond NLP
for real-world applications. Although the large pre-trained domain, such as speech recognition and computer vision.

13267
References Mohamed, A.; Okhonko, D.; and Zettlemoyer, L. 2019.
Bilan, I.; and Roth, B. 2018. Position-aware Self-attention Transformers with convolutional context for ASR. arXiv
with Relative Positional Encodings for Slot Filling. arXiv preprint arXiv:1904.11660 .
preprint arXiv:1807.03052 . Nguyen, T. H.; and Grishman, R. 2015. Relation extrac-
Conneau, A.; Schwenk, H.; Barrault, L.; and Lecun, Y. 2016. tion: Perspective from convolutional neural networks. In
Very deep convolutional networks for natural language pro- Proceedings of the 1st Workshop on Vector Space Modeling
cessing. arXiv preprint arXiv:1606.01781 2. for Natural Language Processing, 39–48.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
BERT: Pre-training of Deep Bidirectional Transformers for Global vectors for word representation. In Proceedings of
Language Understanding. In Proceedings of the 2019 Con- the 2014 conference on empirical methods in natural lan-
ference of the North American Chapter of the Association guage processing (EMNLP), 1532–1543.
for Computational Linguistics: Human Language Technolo- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
gies, Volume 1 (Long and Short Papers), 4171–4186. Sutskever, I. 2019. Language models are unsupervised mul-
Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, titask learners. OpenAI Blog 1(8): 9.
J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. 2020. Con- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.;
former: Convolution-augmented Transformer for Speech Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Ex-
Recognition. arXiv preprint arXiv:2005.08100 . ploring the limits of transfer learning with a unified text-
Guo, M.; Zhang, Y.; and Liu, T. 2019. Gaussian transformer: to-text transformer. Journal of Machine Learning Research
a lightweight approach for natural language inference. In 21(140): 1–67.
Proceedings of the AAAI Conference on Artificial Intelli- Seo, M.; Min, S.; Farhadi, A.; and Hajishirzi, H. 2018. Neu-
gence, volume 33, 6489–6496. ral speed reading via skim-rnn. International Conference on
Hendrickx, I.; Kim, S. N.; Kozareva, Z.; Nakov, P.; Learning Representations .
Ó Séaghdha, D.; Padó, S.; Pennacchiotti, M.; Romano, L.;
Shen, T.; Zhou, T.; Long, G.; Jiang, J.; and Zhang, C. 2018.
and Szpakowicz, S. 2009. Semeval-2010 task 8: Multi-way
Bi-directional block self-attention for fast and memory-
classification of semantic relations between pairs of nomi-
efficient sequence modeling. In International Conference
nals. In Proceedings of the Workshop on Semantic Evalu-
on Representation Learning.
ations: Recent Achievements and Future Directions, 94–99.
Association for Computational Linguistics. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear Salakhutdinov, R. 2014. Dropout: a simple way to prevent
units (gelus). arXiv preprint arXiv:1606.08415 . neural networks from overfitting. The Journal of Machine
Learning Research 15(1): 1929–1958.
Kim, Y. 2014. Convolutional Neural Networks for Sen-
tence Classification. In Proceedings of the 2014 Confer- Tang, D.; Qin, B.; and Liu, T. 2015. Document modeling
ence on Empirical Methods in Natural Language Processing with gated recurrent neural network for sentiment classifi-
(EMNLP), 1746–1751. cation. In Proceedings of the 2015 conference on empirical
methods in natural language processing, 1422–1432.
Le, H. T.; Cerisara, C.; and Denis, A. 2018. Do convolu-
tional networks need to be deep for text classification? In Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Workshops at the Thirty-Second AAAI Conference on Artifi- L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
cial Intelligence. tention is all you need. In Advances in neural information
processing systems, 5998–6008.
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P.
2017. Pruning Filters for Efficient ConvNets. In 5th In- Wang, G.; Li, C.; Wang, W.; Zhang, Y.; Shen, D.; Zhang, X.;
ternational Conference on Learning Representations, ICLR Henao, R.; and Carin, L. 2018. Joint Embedding of Words
2017. and Labels for Text Classification. In Proceedings of the
56th Annual Meeting of the Association for Computational
Li, P.; and Mao, K. 2019. Knowledge-oriented convolutional
Linguistics (Volume 1: Long Papers), 2321–2331.
neural network for causal relation extraction from natural
language texts. Expert Systems with Applications 115: 512– Wang, S.; and Manning, C. D. 2012. Baselines and bigrams:
523. Simple, good sentiment and topic classification. In Proceed-
ings of the 50th annual meeting of the association for com-
Li, P.; Mao, K.; Yang, X.; and Li, Q. 2019. Improving Rela-
putational linguistics: Short papers-volume 2, 90–94. Asso-
tion Extraction with Knowledge-attention. In Proceedings of
ciation for Computational Linguistics.
the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Confer- Wen, Y.; Zhang, K.; Li, Z.; and Qiao, Y. 2016. A dis-
ence on Natural Language Processing (EMNLP-IJCNLP), criminative feature learning approach for deep face recogni-
229–239. tion. In European conference on computer vision, 499–515.
Li, Q.; Li, P.; Mao, K.; and Lo, E. Y.-M. 2020. Improving Springer.
convolutional neural network for text classification by recur- Yang, B.; Tu, Z.; Wong, D. F.; Meng, F.; Chao, L. S.; and
sive data pruning. Neurocomputing 414: 143–152. Zhang, T. 2018. Modeling Localness for Self-Attention Net-

13268
works. In Proceedings of the 2018 Conference on Empirical Zhong, P.; Wang, D.; and Miao, C. 2019. Knowledge-
Methods in Natural Language Processing, 4449–4458. Enriched Transformer for Emotion Detection in Textual
Conversations. In Proceedings of the 2019 Conference on
Yang, B.; Wang, L.; Wong, D. F.; Chao, L. S.; and Tu, Z.
Empirical Methods in Natural Language Processing and
2019a. Convolutional Self-Attention Networks. In Proceed-
the 9th International Joint Conference on Natural Language
ings of the 2019 Conference of the North American Chapter
Processing (EMNLP-IJCNLP), 165–176.
of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; and Xu, B.
4040–4045. 2016. Attention-based bidirectional long short-term mem-
ory networks for relation classification. In Proceedings of
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, the 54th Annual Meeting of the Association for Computa-
R.; and Le, Q. V. 2019b. XLNet: Generalized Autoregres- tional Linguistics (Volume 2: Short Papers), volume 2, 207–
sive Pretraining for Language Understanding. arXiv preprint 212.
arXiv:1906.08237 .
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy,
E. 2016. Hierarchical Attention Networks for Document
Classification. In Proceedings of the 2016 Conference of
the North American Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies, 1480–
1489. San Diego, California: Association for Computational
Linguistics. doi:10.18653/v1/N16-1174. URL https://www.
aclweb.org/anthology/N16-1174.
Yogatama, D.; Dyer, C.; Ling, W.; and Blunsom, P. 2017.
Generative and discriminative text classification with recur-
rent neural networks. In Thirty-fourth International Confer-
ence on Machine Learning (ICML 2017). International Ma-
chine Learning Society.
Yu, A. W.; Dohan, D.; Luong, M.-T.; Zhao, R.; Chen, K.;
Norouzi, M.; and Le, Q. V. 2018. QANet: Combining Local
Convolution with Global Self-Attention for Reading Com-
prehension. In International Conference on Learning Rep-
resentations.
Yu, F.; and Koltun, V. 2015. Multi-scale context aggregation
by dilated convolutions. arXiv preprint arXiv:1511.07122 .
Zhang, H.; Xiao, L.; Wang, Y.; and Jin, Y. 2017a. A gen-
eralized recurrent neural architecture for text classification
with multi-task learning. In Proceedings of the 26th Inter-
national Joint Conference on Artificial Intelligence, 3385–
3391. AAAI Press.
Zhang, J.; Luan, H.; Sun, M.; Zhai, F.; Xu, J.; Zhang, M.;
and Liu, Y. 2018. Improving the Transformer Translation
Model with Document-Level Context. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Lan-
guage Processing, 533–542.
Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level
convolutional networks for text classification. In Advances
in neural information processing systems, 649–657.
Zhang, Y.; Qi, P.; and Manning, C. D. 2018. Graph Convolu-
tion over Pruned Dependency Trees Improves Relation Ex-
traction. In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, 2205–2215.
Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning,
C. D. 2017b. Position-aware attention and supervised data
improve slot filling. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing, 35–
45.

13269

You might also like