Densely Connected Bidirectional LSTM With Applications To Sentence Classification
Densely Connected Bidirectional LSTM With Applications To Sentence Classification
Densely Connected Bidirectional LSTM With Applications To Sentence Classification
Classification
Zixiang Ding1 , Rui Xia1 ∗ , Jianfei Yu2 , Xiang Li1 , Jian Yang1
1
School of Computer Science and Engineering, Nanjing University of Science and Technology
2
School of Information Systems, Singapore Management University
{dingzixiang, rxia}@njust.edu.cn, [email protected],
{xiang.li.implus, csjyang}@njust.edu.cn
Abstract 2016]. One of the representative work among them is the re-
arXiv:1802.00889v1 [cs.CL] 3 Feb 2018
Figure 1: The architecture of DC-Bi-LSTM. We obtain first-layer reading memory based on original input sequence, and second-layer reading
memory based on the position-aligned concatenation of original input sequence and first-layer reading memory, and so on. Finally, we get
the n-th-layer reading memory and take it as the final feature representation for classification.
tions, TREC dataset [Li and Roth, 2002] for question type text word by word and stores the semantics of all the previous
classification and subjectivity classification dataset [Pang and text in a fixed-sized hidden state. In this way, LSTM can bet-
Lee, 2004]. DC-Bi-LSTM with depth up to 20 can be suc- ter capture the contextual information and semantics of long
cessfully trained and significantly outperform the traditional texts. Moreover, bidirectional RNNs [Schuster and Paliw-
Bi-LSTM with the same or even less parameters. Moreover, al, 1997] processes the sequence both forward and backward,
our model achieves indistinguishable performance in compar- naturally, a better semantic representation can usually be ob-
ison with the state-of-the-art approaches. tained than unidirectional RNNs.
The main contributions of our work are summarized as fol-
lows: 2.2 Stacked RNNs and Extensions
Schmidhuber [1992]; EI Hihi and Bengio [1996] introduced
• We propose a novel deep RNNs architecture called stacked RNNs by stacking RNN layers on top of each other.
DC-Bi-LSTM. Compared with conventional deep s- The hidden states of RNN below are taken as inputs to the
tacked RNNs, DC-Bi-LSTM alleviates the problems of RNN above. However, it is very hard to train deep stacked
vanishing-gradient and overfitting and can be success- RNNs due to the feed-forward structure of stacked layers. Be-
fully trained when the networks are as deep as dozens of low are some extensions to alleviate this problem.
layers. Skip connections (or shortcut connections) enable unim-
• We conducted experiments on five datasets of sentence peded information flow by adding direct connections across
classification, our model obtains significant improve- different layers thus alleviate the gradient problems. For ex-
ments over the traditional Bi-LSTM and gets promising ample, Raiko et al. (2012); Graves (2013); Hermans and
performance in comparison with the state-of-the-art ap- Schrauwen (2013); Wu et al. (2016); Yu et al. (2017) in-
proaches. troduced skip connections into stacked RNNs and make it
easier to build deeper stacked RNNs. In addition to using
2 Related Work skip connections, Yao et al. (2015) and Zhang et al. (2016)
proposed highway LSTMs, which extend stacked RNNs by
2.1 Sentence Classification introducing gated direct connections between memory cells
The challenge for sentence classification is to perform com- in adjacent layers.
positions over variable-length sentences and capture useful
features for classification. 3 Model
Traditional methods are commonly based on the bag-of- In this section, we describe the architecture of our proposed
words (BoW) model, which treats sentences as unordered Densely Connected Bidirectional LSTM (DC-Bi-LSTM)
collections thus fails to capture syntactic structures and con- model for sentence classification.
textual information. Recently, deep learning has develope-
d rapidly in natural language processing, the first break- 3.1 Long Short-Term Memory
through is learning word embedding [Bengio et al., 2003; Given an arbitrary-length input sentence S =
Mikolov et al., 2013] . With the help of word embed- {w1 , w2 , . . . , ws }, Long Short-Term Memory (LSTM)
ding, some composition based methods are proposed. For [Hochreiter and Schmidhuber, 1997] computes the hidden
example, Recursive Neural Networks [Socher et al., 2013; states h = {h1 , h2 , . . . , hs } by iterating the following
Irsoy and Cardie, 2014] build representations of phrases and equations:
sentences by combining neighboring constituents based on ht = lstm(ht−1 , e(wt )). (1)
the parse tree. Convolutional Neural Networks [Kim, 2014] The detailed computation is described as follows:
use convolutional filters to extract local features over word
i sigm
embedding matrices. RNNs with Long Short-Term Memory f sigm
units [Mikolov, 2012; Chung et al., 2014; Tai et al., 2015]are o = sigm Tm+d,4d ([e(wt ); ht−1 ]), (2)
effective networks to process sequential data, which analyze a g tanh
Soft-max
Average Pooling
Dense Bi-LSTM
Network Inputs
I like it I like it
Figure 2: Illustration of (a) Deep Stacked Bi-LSTM and (b) DC-Bi-LSTM. Each black node denotes an input layer. Purple, green, and yellow
nodes denote hidden layers. Orange nodes denote average pooling of forward or backward hidden layers. Each red node denotes a class.
Ellipse represents the concatenation of its internal nodes. Solid lines denote the connections of two layers. Finally, dotted lines indicate the
operation of copying.
→
−
ct = f ct−1 + i g, (3) is the concatenation of forward hidden state hlt and backward
ht = o tanh(ct ), (4) ← −
hidden state hlt . The calculation of hlt is as follows:
where e(wt ) ∈ Rm is the word embedding of wt , ht−1 ∈ Rd →
− ← −
is the hidden state of LSTM at time step t−1, [e(wt ); ht−1 ] ∈ hlt = [ hlt ; hlt ], specially, h0t = e(wt ), (7)
Rm+d is concatenation of the two vectors. Tm+d,4d : →
−l −− →
Rm+d → R4d is an affine transform (W x + b for some W ht = lstm(hlt−1 , hl−1 t ), (8)
and b). sigm and tanh are respectively sigmoid and hyperbolic ←
−l ←l−− l−1
tangent activation functions, i, f, o, g ∈ Rd are respectively ht = lstm(ht+1 , ht ). (9)
input gate, forget gate, output gate and new candidate memo-
ry state. Particularly, ct ∈ Rd is additional memory cell used
3.3 Densely Connected Bi-LSTM
for capturing long distance dependencies, denotes element- As shown in Figure 2 (b), Densely Connected Bi-LSTM (DC-
wise multiplication. Bi-LSTM) consists of four modules: network inputs, dense
Bi-LSTM, average pooling and soft-max layer.
3.2 Deep Stacked Bi-LSTM (1) Network Inputs
Bidirectional LSTM (Bi-LSTM) [Graves et al., 2013] uses t- The input of our model is a variable-length sentence,
wo LSTMs to process sequence in two directions: forward which can be represented as S = {w1 , w2 , . . . , ws }
and backward. In this way, the forward and backward con- . Like other deep learning models, each word is rep-
texts can be considered simultaneously. The calculations of resented as a dense vector extracted from a word em-
Bi-LSTMs can be formulated as follows: bedding matrix. Finally, a sequence of word vectors
→
− −−→ {e(w1 ), e(w2 ), . . . , e(ws )} is sent to the dense Bi-LSTM
ht = lstm(ht−1 , e(wt )), (5)
module as inputs.
←− ←−−
ht = lstm(ht−1 , e(wt )). (6) (2) Dense Bi-LSTM
Then the concatenation of forward and backward hidden s- This module consists of multiple Bi-LSTM lay-
tates is taken as the representation of each word. For word ers. For the first Bi-LSTM layer, the input is a
→
− ← − word vector sequence {e(w1 ), e(w2 ), . . . , e(ws )},
wt , the representation is denoted as ht = [ ht ; ht ].
As shown in Figure 2 (a), deep stacked Bi-LSTM [Schmid- and the output is h1 = {h11 , h12 , . . . , h1s } , in which
−
→ ← −
huber, 1992]; [El Hihi and Bengio, 1996] uses multiple Bi- h1t = [h1t ; h1t ] as described in Section 3.2. For the
LSTMs with different parameters in a stacking way. The hid- second Bi-LSTM layer, the input is not the sequence
den state of l-layer Bi-LSTM can be represented as hlt , which {h11 , h12 , . . . , h1s } (the way stacked RNNs use), but
the concatenation of all previous outputs, formu- Table 2: Summary statistics of benchmark datasets. c: Number of
lated as {[e(w1 ); h11 ], [e(w2 ); h12 ], . . . , [e(ws ); h1s ]}, target classes. l: Average sentence length. train/dev/test: size of
train/development/test set, CV in test column means 10-fold cross
and the output is h2 = {h21 , h22 , . . . , h2s }
validation.
. For the third layer, whose input is
{[e(w1 ); h11 ; h21 ], [e(w2 ); h12 ; h22 ], . . . , [e(ws ); h1s ; h2s ]}, Data c l train dev test
like the second layer does. The rest layers process
similarly and omitted for brevity. The above process is MR 2 20 10662 - CV
formulated as follows: SST-1 5 18 8544 1101 2210
SST-2 2 19 6920 872 1821
→
− ← −
hlt = [ hlt ; hlt ], specially, h0t = e(wt ), (10) Subj 2 23 10000 - CV
TREC 6 10 5452 - 500
→
−l −−→
ht = lstm(hlt−1 , Mtl−1 ), (11)
←−l ←l−− 3.5 Potential Application Scenario
ht = lstm(ht+1 , Mtl−1 ), (12) From a semantic perspective, the dense Bi-LSTM module
Mtl−1 = [h0t ; h1t ; ...; hl−1 (13) adds multi-read context information of each word into their
t ].
original word vector in a concatenation way: h1 is the first
(3) Average Pooling reading memory based on the input sentence S, h2 is the sec-
For a L layer Dense Bi-LSTM, the output is hL = ond reading memory based on S and h1 , hk is the k-th read-
{hL L L
1 , h2 , . . . , hs }. Average pooling module reads in ing memory based on S and all previous reading memory.
h and calculate the average value of these vec-
L
Since the word vector for each word is completely preserved,
tors, the computation can be formulated as h∗ = this module is harmless and can be easily added to other mod-
average(hL L L
1 , h2 , . . . , hs ). els that use RNN. For example, in the task of machine transla-
(4) Soft-max Layer tion and dialog system, the Bi-LSTM encoder can be replaced
This module is a simple soft-max classifier, which takes by dense Bi-LSTM module and may bring improvements.
h∗ as features and generates predicted probability distri-
bution over all sentence labels. 4 Experiments
4.1 Dataset
3.4 Comparison with Deep Stacked Bi-LSTM
DC-Bi-LSTM are evaluated on several benchmark datasets,
As shown in Figure 2, they have the same network input- and the summary statistics are shown in Table 2.
s, average pooling and soft-max layer, but differ in dense
Bi-LSTM module. For the k-th Bi-LSTM layer, the in- • MR: Movie Review Data is a popular sentiment classifi-
put of deep stacked Bi-LSTM is {hk−1 , hk−1 , . . . , hk−1 }, cation dataset proposed by [Pang and Lee, 2005]. Each
1 2 s
while for densely connected Bi-LSTM, the input is review belongs to positive or negative sentiment class
{[e(w1 ); h11 ; h21 ; . . . ; hk−1 ], [e(w2 ); h12 ; h12 ; . . . ; h2k−1 ], and contains only one sentence.
1
. . . , [e(ws ); hs ; hs ; . . . ; hs ]} . Thanks to this densely con-
1 2 k−1 • SST-1: Stanford Sentiment Treebank is an extension of
nected structure, DC-Bi-LSTM have several advantages: MR [Socher et al., 2013]. And each review has fine-
grained labels (very positive, positive, neutral, negative,
• Easy to train even when the network is very deep. The very negative), moreover, phrase-level annotations on all
reason is that: for every RNN layer, the output is directly inner nodes are provided.
sent to the last RNN layer as input and leads to an im-
plicit deep supervision, which alleviates the problem of • SST-2: The same dataset as SST-1 but used in binary
vanishing-gradient. mode without neutral sentences.
• Better parameter efficiency, that means, DC-Bi-LSTM • Subj: Subjectivity dataset is from [Pang and Lee, 2004],
obtains better performance with equal or less parameters where the task is to classify a sentence as being subjec-
compared with traditional RNNs or deep stacked RNNs. tive or objective.
That is because: for every RNN layer, it can directly • TREC: TREC is a dataset for question type classification
read the original input sequence thus it doesn’t have the task [Li and Roth, 2002]. The sentences are questions
burden to pass on all useful information and just adds from 6 classes (person, location, numeric information,
information to the network. Therefore, DC-Bi-LSTM etc.).
layers are very narrow (for example, 10 hidden units per
layer). 4.2 Implementation Details
It is worth to note that deep residual Bi-LSTM [Yu et al., In the experiments, we use publicly available 300-
2017] looks very similar to our model, however, they are es- dimensional Glove vectors that were trained on 42 billion
sentially different. For each layer, deep residual Bi-LSTM words, moreover, the words are case-insensitive. For those
uses point-wise summation to merge input into output, which words not present in the set of the pre-trained words, we just
may impede the information flow in the network. In contrast, abandon them.
our model merges input into output by concatenation to fur- For model details, the number of hidden units of top Bi-
ther improve the information flow. LSTM (the last Bi-LSTM layer in dense Bi-LSTM module)
Table 1: Classification accuracy of DC-Bi-LSTM against other state-of-the-art models. The best result of each dataset is highlighted in bold.
There are mainly five blocks: i) traditional machine learning methods; ii) Recursive Neural Networks models; iii) Recurrent Neural Networks
models; iv) Convolutional Neural Net-works models; v) a collection of other models. SVM: Support Vector Machines with unigram features
[Socher et al., 2013] NB: Na-ive Bayes with unigram features [Socher et al., 2013] Standard-RNN: Standard Recursive Neural Network
[Socher et al., 2013] RNTN: Recursive Neural Tensor Network [Socher et al., 2013] DRNN: Deep Recursive Neural Network [Irsoy and
Cardie, 2014] LSTM: Standard Long Short-Term Memory Network [Tai et al., 2015] Bi-LSTM: Bidirectional LSTM [Tai et al., 2015]
Tree-LSTM: Tree-Structured LSTM [Tai et al., 2015] LR-Bi-LSTM: Bidirectional LSTM with linguistically regularization [Qian et al.,
2016] CNN-MC: Convolutional Neural Network with two channels [Kim, 2014] DCNN: Dynamic Convolutional Neural Network with
k-max pooling [Kalchbrenner et al., 2014] MVCNN: Multi-channel Variable-Size Convolution Neural Network [Yin and Schütze, 2016]
DSCNN: Dependency Sensitive Convolutional Neural Networks that use CNN to obtain the sentence representation based on the context
representations from LSTM [Zhang et al., 2016a] BLSTM-2DCNN: Bidirectional LSTM with Two-dimensional Max Pooling [Zhou et al.,
2016].
is 100, for the rest layers of dense Bi-LSTM module, the num- 4.4 Discussions
ber of hidden units and layers are 13 and 15 respectively. Moreover, we conducted some experiments to further explore
For training details, we use the stochastic gradient descent DC-Bi-LSTM. For simplicity, we denote the number of hid-
(SGD) algorithm and Adam update rule with shuffled mini- den units of top Bi-LSTM (the last Bi-LSTM layer in dense
batch. Batch size and learning rate are set to 200 and 0.005, Bi-LSTM module) as th , for the rest layers of dense Bi-
respectively. As for regularization, dropout is applied for LSTM module, the number of hidden units and layers are
word embeddings and the output of average pooling, besides, denoted as dh and dl respectively. We tried several variants
we perform L2 constraints over the soft-max parameters. of DC-Bi-LSTM with different dh, dl and th, The results are
shown below.
4.3 Results
Results of DC-Bi-LSTM and other state-of-the-art models on (1) Better parameter efficiency
five benchmark datasets are listed in Table 1. Performance Better parameter efficiency means obtaining better per-
is measured in accuracy. We can see that DC-Bi-LSTM get- formance with equal or less parameters. In order to veri-
s consistently better results over other methods, specifically, fy DC-Bi-LSTM has better parameter efficiency than Bi-
DC-Bi-LSTM achieves new state-of-the-art results on three LSTM, we limit the number of parameters of all models
datasets (MR, SST-2 and Subj) and slightly lower accuracy at 1.44 million (1.44M) and conduct experiments on SST-
than BLSTM-2DCNN on TREC and SST-1. In addition, we 1 and SST-2. The results are shown in Table 3.
have the following observations: The first model in Table 3 is actually Bi-LSTM with
• Although DC-Bi-LSTM is a simple sequence model, but 300 hidden units, which is used as the baseline model,
it defeats Recursive Neural Networks models and Tree- and the results are consistent with the paper [Tai et al.,
LSTM, which relies on parsers to build tree-structured 2015]. Based on the results of Table 3, we get the follow-
neural models. ing conclusions:
• DC-Bi-LSTM obtains significant improvement over the • DC-Bi-LSTM improves parameter efficiency. Pay
counterparts (Bi-LSTM) and variant (LR-Bi-LSTM) attention to the second to the fifth model, compared
that uses linguistic resources. with baseline model, the increase on SST-1(SST-2)
are 0.4% (1.2%), 1.8% (1.3%), 2.7% (2.5%) and
• DC-Bi-LSTM defeats all CNN models in all datasets. 1% (1.6%), respectively, with the parameters not
Above observations demonstrate that DC-Bi-LSTM is increased, which demonstrates that DC-Bi-LSTM
quite effective compared with other models. models have better parameter efficiency than base-
Table 3: Classification accuracy of DC-Bi-LSTM with different hy- Table 5: Classification accuracy of DC-Bi-LSTM with different hy-
per parameters. we limit the parameters of all models at 1.44M in per parameters. we increase dh gradually and fix dl at 10 in order to
order to verify DC-Bi-LSTM models have better parameter efficien- explore the effect of dh models.
cy than Bi-LSTM.
dl dh th Params SST-1 SST-2
dl dh th Params SST-1 SST-2 10 0 100 0.32M 48.5 87.5
0 10 300 1.44M 49.2 87.2 10 5 100 0.54M 49.2 88.3
5 40 100 1.44M 49.6 88.4 10 10 100 0.80M 49.5 88.4
10 20 100 1.44M 51.0 88.5 10 15 100 1.10M 50.2 88.4
15 13 100 1.40M 51.9 89.7 10 20 100 1.44M 51.0 88.5
20 10 100 1.44M 50.2 88.8
Table 4: Classification accuracy of DC-Bi-LSTM with different hy- • Among all models, the model with dl equal to 15
per parameters. we increase dl gradually and fix dh at 10 in order to works best. As dl continues to increase, the accu-
verify that increasing dl does improve performance of DC-Bi-LSTM racy does not further improve, nevertheless, there is
models. no significant decrease.
dl dh th Params SST-1 SST-2 • Compared with the first model in Table 3, the fourth
model here uses less parameters (1.10M vs. 1.44M)
0 10 100 0.32M 48.5 87.5
but performs much better (50.6% vs. 49.2% in SST-
5 10 100 0.54M 49.4 88.1
1, 88.8% vs. 87.2% in SST-2), which further proves
10 10 100 0.80M 49.5 88.4
that DC-Bi-LSTM models have better parameter ef-
15 10 100 1.10M 50.6 88.8
ficiency.
20 10 100 1.44M 50.2 88.8
(3) Effects of adding hidden units (dh)
In this part, we explore the effect of dh. The number of
line model
layers in dense Bi-LSTM module (dl) is fixed at 10 while
• DC-Bi-LSTM models are easy to train even when the number of hidden units (dh) is gradually increased.
the they are very deep. We can see that DC-Bi- The results on SST-1 and SST-2 are shown in Table 5.
LSTM with depth of 20 (the fifth model in Table
3) can be successfully trained and gets better result- Similarly, we use Bi-LSTM with 100 hidden units as
s than baseline model (50.2% vs. 49.2% in SST-1, baseline model (the first model in Table 5). Based on the
88.8% vs. 87.2% in SST-2). In contrast, we trained results of Table 5, we can get the following conclusions:
deep stacked LSTM on SST-1, when depth reached
• Comparing the first two models, we find that the sec-
more than five, the performance (For example, 30%
ond model outperforms baseline by 0.7% on SST-1
when the depth is 8, which drops 19.2% compared
and 0.8% on SST-2, which shows that even if dh is
with baseline model) drastically decreased.
equal to 5, DC-Bi-LSTM are still effective.
• The fifth model performs worse than the fourth
model, which indicates that too many layers will • As dh increases, the performance of DC-Bi-LSTM
bring side effects when limiting the number of pa- steadily increases. One possible reason is that the a-
rameters. One possible reason is that more layer bility of each layer to capture contextual information
lead to less hidden units (to ensure the same num- is enhanced, which eventually leads to the improve-
ber of parameters), impairing the ability of each Bi- ment of classification accuracy.
LSTM layer to capture contextual information.
(2) Effects of increasing depth (dl) 5 Conclusion & Future Work
In order to verify that increasing dl does improve perfor-
mance of DC-Bi-LSTM models, we increase dl gradual- In this work, we propose a novel multi-layer RNN mod-
ly and fix dh at 10 . The results on SST-1 and SST-2 are el called Densely Connected Bidirectional LSTM (DC-Bi-
shown in Table 4. LSTM) for sentence classification tasks. DC-Bi-LSTM alle-
viates the problems of vanishing-gradient and overfitting and
The first model in Table 4 is actually Bi-LSTM with
can be successfully trained when the networks are as deep as
100 hidden units, which is used as the baseline model.
dozens of layers. We evaluate our proposed model on five
Based on the results of Table 4, we can get the following
benchmark datasets of sentence classification, experiments
conclusions:
show that our model obtains significant improvements over
• It is obvious that the performance of DC-Bi-LSTM the traditional Bi-LSTM and gets promising performance in
is positively related to dl. Compared with base- comparison with the state-of-the-art approaches. As future
line model, DC-Bi-LSTM with dl equal to 5, 10, 15 work, we plan to apply DC-Bi-LSTM in the task of machine
and 20 get improvements on SST-1 (SST-2) by 0.9% translation and dialog system to further improve their per-
(0.6%), 1.0% (0.9%), 2.1% (1.3%) and 1.7% (1.3%) formance, for example, replace the Bi-LSTM encoder with
respectively. dense Bi-LSTM module.
References Bernstein, et al. Imagenet large scale visual recogni-
[Bengio et al., 2003] Yoshua Bengio, Réjean Ducharme, tion challenge. International Journal of Computer Vision,
Pascal Vincent, and Christian Jauvin. A neural probabilis- 115(3):211–252, 2015.
tic language model. Journal of machine learning research, [Schmidhuber, 1992] Jürgen Schmidhuber. Learning com-
3(Feb):1137–1155, 2003. plex, extended sequences using the principle of history
[Chung et al., 2014] Junyoung Chung, Caglar Gulcehre, compression. Neural Computation, 4(2):234–242, 1992.
KyungHyun Cho, and Yoshua Bengio. Empirical evalua- [Schuster and Paliwal, 1997] Mike Schuster and Kuldip K
tion of gated recurrent neural networks on sequence mod- Paliwal. Bidirectional recurrent neural networks. IEEE
eling. arXiv preprint arXiv:1412.3555, 2014. Transactions on Signal Processing, 45(11):2673–2681,
[El Hihi and Bengio, 1996] Salah El Hihi and Yoshua Ben- 1997.
gio. Hierarchical recurrent neural networks for long-term [Socher et al., 2013] Richard Socher, Alex Perelygin, Jean
dependencies. In NIPS, 1996. Wu, Jason Chuang, Christopher D Manning, Andrew Ng,
[Graves et al., 2013] Alex Graves, Abdel-rahman Mohamed, and Christopher Potts. Recursive deep models for se-
mantic compositionality over a sentiment treebank. In
and Geoffrey Hinton. Speech recognition with deep recur-
EMNLP, 2013.
rent neural networks. In ICASSP, 2013.
[Srivastava et al., 2015] Rupesh K Srivastava, Klaus Greff,
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
and Jürgen Schmidhuber. Training very deep networks.
Ren, and Jian Sun. Deep residual learning for image recog-
In NIPS, 2015.
nition. In CVPR, 2016.
[Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing
[Huang et al., 2017] Gao Huang, Zhuang Liu, Kilian Q
Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Weinberger, and Laurens van der Maaten. Densely con-
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich,
nected convolutional networks. In CVPR, 2017.
et al. Going deeper with convolutions. In CVPR, 2015.
[Irsoy and Cardie, 2014] Ozan Irsoy and Claire Cardie. Deep [Szegedy et al., 2016] Christian Szegedy, Vincent Van-
recursive neural networks for compositionality in lan- houcke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.
guage. In NIPS, 2014. Rethinking the inception architecture for computer vision.
[Kalchbrenner et al., 2014] Nal Kalchbrenner, Edward In CVPR, 2016.
Grefenstette, and Phil Blunsom. A convolutional neu- [Tai et al., 2015] Kai Sheng Tai, Richard Socher, and
ral network for modelling sentences. arXiv preprint Christopher D Manning. Improved semantic representa-
arXiv:1404.2188, 2014. tions from tree-structured long short-term memory net-
[Kim, 2014] Yoon Kim. Convolutional neural networks for works. arXiv preprint arXiv:1503.00075, 2015.
sentence classification. arXiv preprint arXiv:1408.5882, [Yin and Schütze, 2016] Wenpeng Yin and Hinrich Schütze.
2014. Multichannel variable-size convolution for sentence clas-
[Li and Roth, 2002] Xin Li and Dan Roth. Learning question sification. arXiv preprint arXiv:1603.04513, 2016.
classifiers. In COLING, 2002. [Yu et al., 2017] Mo Yu, Wenpeng Yin, Kazi Saidul Hasan,
[Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Cicero dos Santos, Bing Xiang, and Bowen Zhou. Im-
Chen, Greg S Corrado, and Jeff Dean. Distributed rep- proved neural relation detection for knowledge base ques-
resentations of words and phrases and their composition- tion answering. arXiv preprint arXiv:1704.06194, 2017.
ality. In NIPS, 2013. [Zhang et al., 2016a] Rui Zhang, Honglak Lee, and
[Mikolov, 2012] Tomáš Mikolov. Statistical language mod- Dragomir Radev. Dependency sensitive convolutional
els based on neural networks. Presentation at Google, neural networks for modeling sentences and documents.
Mountain View, 2nd April, 2012. arXiv preprint arXiv:1611.02361, 2016.
[Pang and Lee, 2004] Bo Pang and Lillian Lee. A sentimen- [Zhang et al., 2016b] Yu Zhang, Guoguo Chen, Dong Yu,
tal education: Sentiment analysis using subjectivity sum- Kaisheng Yaco, Sanjeev Khudanpur, and James Glass.
marization based on minimum cuts. In ACL, 2004. Highway long short-term memory rnns for distant speech
[Pang and Lee, 2005] Bo Pang and Lillian Lee. Seeing stars: recognition. In ICASSP, 2016.
Exploiting class relationships for sentiment categorization [Zhou et al., 2016] Peng Zhou, Zhenyu Qi, Suncong Zheng,
with respect to rating scales. In ACL, 2005. Jiaming Xu, Hongyun Bao, and Bo Xu. Text classi-
[Qian et al., 2016] Qiao Qian, Minlie Huang, Jinhao Lei, fication improved by integrating bidirectional lstm with
two-dimensional max pooling. arXiv preprint arX-
and Xiaoyan Zhu. Linguistically regularized lstms for sen-
iv:1611.06639, 2016.
timent classification. arXiv preprint arXiv:1611.03949,
2016.
[Russakovsky et al., 2015] Olga Russakovsky, Jia Deng,
Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael