Sancheng Peng, Lihong Cao, Yongmei Zhou, Zhouhao Ouyang, Aimin Yang,
Xinguang Li, Weijia Jia, Shui Yu
Please cite this article as: S. Peng, L. Cao, Y. Zhou, Z. Ouyang, A. Yang, X. Li, W. Jia, S. Yu, A survey
on deep learning for textual emotion analysis in social networks, Digital Communications and Networks,
Digital Communications and Networks(DCN)
Sancheng Penga , Lihong Caob,∗ , Yongmei Zhouc , Zhouhao Ouyangd , Aimin Yange
Xinguang Lia , Weijia Jia f , Shui Yug ,
a Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou, 510006, China.
b School of English Education, Guangdong University of Foreign Studies, Guangzhou, 510006, China
c School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510006, China
d School of Computing, University of Leeds, Wood-house Lane, Leeds, West Yorkshire, LS2 9JT, United Kingdom.
e School of Computer, Guangdong University of Technology, Guangzhou, 510006, China
f BNU-UIC Institute of Artificial Intelligence and Future Networks, Beijing Normal University (BNU Zhuhai), Zhuhai, 519087, China
g School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, 510006, China
Textual Emotion Analysis (TEA) aims to extract and analyze user emotional states in texts. There has been rapid development
of various Deep Learning (DL) methods that have proven successful in many domains such as audio, image, and natural language
processing. This trend has drawn increasing numbers of researchers away from traditional machine learning to DL for their scientific
research. In this paper, we provide an overview on TEA based on DL methods. After introducing a background for emotion analysis that
includes defining emotion, emotion classification methods, and application domains of emotion analysis, we summarize DL technology,
and the word/sentence representation learning method. We then categorize existing TEA methods based on text structures and linguistic
types: text-oriented monolingual methods, text conversations-oriented monolingual methods, text-oriented cross-linguistic methods,
and emoji-oriented cross-linguistic methods. We close by discussing emotion analysis challenges and future research trends. We hope
that our survey will assist interested readers in understanding the relationship between TEA and DL methods while also improving
TEA development.
1. Introduction The idiom’s “seven emotions and six desires” are joy,
love, anger, sadness, fear, evil, and desire. Among them,
Textual Emotion Analysis (TEA) is the task of extract- only a few are positive emotions and the rest are nega-
ing and analyzing user emotional states in texts. TEA tive, indicating that people are naturally more sensitive to
not only acts as a standalone tool for information extrac- negative emotions. During real-world life, negative emo-
tion but also plays an important role for various Natu- tions are also easier to propagate than positive emotions.
ral Language Processing (NLP) applications, including e- The 2011 Annual Report of China’s Internet Public Opin-
commerce [1], public opinion analysis [2], big search [3], ion Index [8, 9] stated “In 2011, there are more than 80%
information prediction [4], personalized recommendation negative events for the total number of topics. The nega-
[5], healthcare [6], and online teaching [7]. tive events on the Microblog and Tianya forum accounted
for 75.6% and 95.8%, respectively, which are higher than
∗ Corresponding author.
(S. Peng), [email protected] (L. Cao), yong-
[email protected] (Y. Zhou), [email protected] (Z. Ouyang),
With the rapid development of social networks [10,
[email protected] (A. Yang), [email protected] (X. Li), ji- 11, 12, 13], people have changed from general users to
[email protected] (W. Jia), [email protected] (S. Yu) network information producers. According to the 38th
S. Peng, et al.
statistical report on the development of China’s Internet 2. We make the first attempt to provide a comprehen-
published by China Internet Information Center [14], the sive review on the related TEA methods.
number of Internet users in China has reached 710 mil- 3. We provide a detailed overview of different defini-
lion, and the Internet penetration rate reached 51.7% in tions and classification models of emotion.
June 2016. Among them, 656 million were mobile In- 4. We provide a detailed overview on the related TEA
ternet users, 242 million used Microblog, and more than applications.
100 million had daily blogs. Among this massive number 5. We provide a detailed description and comparison
of short text messaging, negative emotions are the most analysis of related pre-training TEA methods.
6. We provide a detailed description and comparison
Emotion analysis [15] aims to automatically extract analysis of existing DL-based TEA methods.
user emotional states from their social network text activ-
ity (e.g., blogs, tweets). Early research focused on either a The remainder of this paper is organized as follows: In
positive/negative bipartition or a positive/negative/neutral Section 2, we provide an overview of the term “emotion";
tripartition of emotion analysis [16, 17]. However, such we then survey emotion analysis applications in Section
partitioning ignores subtle user emotion changes and their 3. In Section 4, we provide an overview of DL methodol-
psychological states, preventing a full expression of peo- ogy, before surveying pre-training methods in Section 5.
ple’s complex inner emotional world. This gave rise to In Section 6, we discuss emotion analysis methods based
bipartition-oriented emotion analysis being named “senti- on DL, and then present emotion analysis challenges in
ment analysis" [18, 19], while more encompassing emo- Section 7. In Section 8, we discuss the future trends, be-
tion analysis was dubbed “fine-grained sentiment analy- fore concluding this paper in Section 9.
In 2012, Deep Learning (DL) methods [20, 21] were
introduced to NLP after they achieved successful object 2. Emotion Overview
recognition via ImageNet [22]. DL methods improved on Due to the variability and sensitivity of human emo-
statistical learning results in many fields. At present, a tions, people have different understandings of emotion,
neural network-based NLP framework has achieved new
causing the term to be classified differently by differ-
levels of quality and become the dominating technology ent fields. At present, there is no unified standard for
for NLP tasks, such as sentiment analysis, machine trans- defining and classifying emotion academically; however,
lation, and question answering systems. researchers have performed in-depth studies of emotion
Popular DL methods are used to model emotion anal- classification, presenting multiple definition and classifi-
ysis, including Deep Averaging Networks (DANs) [23], cation models.
Denoising Autoencoders (DAEs) [24], Convolutional
Neural Networks (CNNs) [25], Recurrent Neural Net- 2.1. Defining Emotion
Definition 6: Emotion [38] is often defined as an in- marketing decisions, allowing them to reach market dom-
dividual’s mental state associated with thoughts, feelings, inance.
and behavior.
means Internet violence or entertainment, driving people
and fear; trust and disgust; and surprise and anticipation. towards more extreme emotional reaction. Irrational ne-
Lin Chuanding [42], a Chinese modern psychologist, tizen emotion causes national and societal security risks.
divided emotions into 18 Shuowen-based categories: joy, Thus, relevant national management departments require
quiet, caress, worry, fright, pity, fear, grief, shame, sor- knowledge of network public opinion trends to guide that
row, anger, vexation, reverence, hatred, arrogance, greed, opinion properly and in a timely fashion. However, when
jealousy, and shame.
These existing emotional category approaches focus on
p such information is obtained through various channels,
e- its complexity prevents manual processing. This shortfall
modeling emotions based on distinct emotion classes or makes developing accurate and effective emotion analysis
labels. These models assume discrete emotion categories systems significant as well as the automatic processing of
exist. network public opinion information necessary to maintain
national security and social stability.
3. Emotion Analysis Application
Emotion analysis has been widely studied in psychol- 3.3. Big Search
ogy, neuroscience, and behavioral science, as emotions
With network space expansion, network application
ing e-commerce, public opinion analysis, big search, in- the Internet has become ubiquitous and given rise to big
formation prediction (e.g., financial prediction, presiden- search technology [3]. Big search is becoming a strong
tial election prediction), personalized recommendation, tool and catalyst for network development. Big search,
healthcare (e.g., depression screening), and online teach- the next generation search engine for cyberspace, is be-
ing. coming an urgent need. Compared with traditional search,
big search can understand user search intentions on a se-
3.1. E-commerce mantic level while also perceiving user needs according
to their spatio-temporal location, emotional state, and his-
With mobile Internet development, online shopping has
torical preferences. Big search can also remove false data
become more popular with users often providing per-
and protect user privacy. In addition, big search solutions
sonal comments on products purchased via Taobao, Jing-
can provide intelligent answers to users, making retrieval
dong, Amazon, and other e-commerce platforms. Us-
technology based on user emotion analysis an important
ing this source of product review [1], conducted real-time
research task for big search.
emotional analysis to obtain useful emotional and behav-
ioral consumer characteristics, enabling the prediction of
trend changes in consumer preferences. Such information 3.4. Information Prediction
would help a majority of consumers deeply understand the
quality of goods, pre-sale and after-sales services, logis- As the Internet has developed, larger numbers of people
tic services, and other related information, guiding them rely on it for information and communication sharing, par-
through their future purchases. Manufacturers would also ticularly for social network interactions (e.g., Microblog,
benefit from first-hand consumer feedback, timely prod- Wechat, stock, and futures forums). Emotion analysis
uct shortage warnings, and improved product quality and technology can be used to analyze the impact of social
design. Sellers would benefit from knowing consumer networks upon user lives and predict developing trends by
psychological states as they relate to available commodi- way of the commentary, news articles, and other content.
ties and related services. Sellers who can capture con- The main application of information prediction includes
sumer psychology can make timely sales, purchase, and the following three aspects.
S. Peng, et al.
Emotion analysis plays an increasingly important role
in the prediction of democratic elections. Paul et al. [4] Courses (MOOCs) [49, 50], a large number of online
presented a framework, called Compass, that used the courses and reviews have been generated. Most reviews
2016 U.S. presidential election as an example to ana- allow students to express their emotions and opinions.
lyze election-related crowd emotion. They built a spatial- Tucker et al. [7] found a positive correlation between stu-
temporal sentiment map through Compass for the elec- dent emotional tendencies from their forum-based reviews
tion, and used that map to match election results to an
p and their learning performance on the MOOC platform.
extent. Their study showed that any political event can be Thus, using emotion analysis technology to analyze com-
described by its popularity in negative and positive senses. ment information on the MOOC platform allowed the au-
In addition, Ceron et al. [43] used emotion analysis to cal- thors to obtain course-related emotion information. Such
culate Twitter support rates for political leadership candi- information can help teachers find problems in curriculum
dates in the 2011 Italian parliamentary election and the arrangement, knowledge systems and teaching methods,
2012 French presidential election. enabling timely teaching plan and method optimization to
further improve teaching quality and student learning ef-
3.4.3. Other prediction:
Emotion analysis can also be used to predict public
applied to natural disaster prediction and judgment, in- essence, it is a method of learning complex feature repre-
cluding epidemics [44] and earthquakes. With the appli- sentation, based on original feature input, through multi-
cation of information prediction, emotion analysis tech- layer nonlinear processing. If combined with specific
nology has received greater attention. With emotion anal- domain tasks, DL can construct new classifiers or gen-
ysis technology to analyze Internet news, blogs, and other erating tools through the feature representation of auto-
information sources, developing event trends can be pre- matic learning, and realize domain-oriented classification
dicted accurately. or other tasks. The specific steps of the algorithm for a
DL model are listed as follows [51, 52]:
Step 1: Construct a learning network with random ini-
3.5. Personalized Recommendation
tialization, set the total number of network training lay-
The emergence of personalized recommendation sys- ers n, initialize unlabeled data as the input set of network
tems [45, 46, 47] has provided users with a tool to ad- training, and initialize training network layer i = 1.
dress information overload issues. However, traditional Step 2: Based on the input set, an unsupervised learning
recommendation technology only considers overall user algorithm is used to pre-train the learning network of the
scores while ignoring emotion information contained in current layer.
user comments. Such commentary usually contains sub- Step 3: The training results of each layer are used as
jective user views, preferences, and emotions regarding input for the next layer, constructing the input set once
certain attributes of things, reflecting user emotional ten- again.
dencies for those attributes. Mining and exploiting user Step 4: If i is less than n, then i = i + 1, and return to
commentary to its fullest extent gives rise to more accu- Step 2; otherwise, proceed to Step 5.
rate personalized recommendation while helping resolve Step 5: The supervised learning method is used to ad-
issues such as cold starts, data sparsity, and low recom- just network parameters of all layers, forcing any errors to
mendation accuracy. meet practical requirements.
A Survey on Deep Learning for Textual Emotion Analysis
softmax y
h2 g(y) LH(x, z)
x x z
Fig. 1: A DAN with two feed-forward layers
Step 6: Complete classifier construction (such as neural
network classifiers) or complete deep generation model
construction (such as a Deep Neural Network (DNN)). wn
input convolution pooling softmax
4.2. DL-related Methods
Fig. 3: The framework of CNN
In this subsection, we will provide an overview of re-
lated DL methods. The basic understanding of related
methods are summarized as follows.
p 4.2.2. DAE:
A DAE is an unsupervised learning algorithm that acts
as an autoencoder modification. It can form a DL net-
4.2.1. DAN:
work with multiple stacked layers. A denoising autoen-
DANs are constructed by stacking nonlinear layers over coder (shown in Fig. 2) consists of encoder, hide layer,
traditional neural bag-of-words models. For each docu- and decoder.
ment, a DAN takes the arithmetic mean of the word vec- Encoder f ( x̄) is used to reduce the dimensionality of
tors as input, and passes it through one or more feed- high-dimensional input. Input x is added noise to obtain a
forward layers until there is a softmax for classification. destroy version x̄, which is input into f ( x̄). Implicit cod-
The framework is shown in Fig. 1.
an input sequence of n tokens to one of k labels, and it reconstructed vector z. The specific computing for y and
also requires the following three steps to function: z is described as follows:
Step 1: Take the vector average of the embedding asso-
E ∈ n×m (9)
where m denotes the dimensionality of a word vector.
Convolution layer: It uses different convolution ker-
nels to perform the convolution operation for the input
matrix, extract local features from the input, and obtain
feature maps of the convolution kernel. The specific rep-
resentation is described as follows:
X lj = f ( Xil−1 ∗ Kil j + blj )
notes all local received domains that the convolution ker- where yt denotes the output in time t, Whh , W xh , and Why
nel needs to traverse. denote the weight matrix, bh and by denote the bias term,
Pooling layer: It uses the corresponding sampling ht denotes the hidden state in t, and xt denotes the input
features from being discarded. Common pooling algo- relationship of long-term dependency, characterize the in-
rithms include max pooling and average pooling. formation of a time sequence, and effectively solve the
Full connection layer: It classifies input, obtains the problem of gradient vanishing or gradient exploding faced
classification results, and is responsible for taking those during RNN training. It was first proposed by Hochre-
results to the output layer. iter and Schmidhuber in 1997 [27]. After that, many re-
searchers have optimized and improved upon it, causing
4.2.4. RNN: rapid development and leading to its wide use among var-
An RNN is a kind of feed-forward neural network with ious NLP aspects.
a ring structure and a specific memory function. Its in- Each unit of an LSTM network consists of four compo-
put includes current input samples as well as information nents: a memory cell, input gate, output gate, and forget
obtained in the previous time, so that information can be gate. Memory cells are connected circularly with each
cycled in the network at any time. The framework of an other. Three nonlinear gate cells can be used to adjust the
RNN is shown in Fig. 4, in which x denotes the input information of memory cell input and output flows. The
layer, O denotes the output layer, H denotes the hidden framework of an LSTM network is shown in Fig. 5.
layer, and u, V, and w denote the weights of the above The forward computing process of an LSTM network
each respective layer. The output of the hidden layer is described as follows.
described by: Input gate:
ht-1 ht ht+1 rt ut ht
σ σ tanh
xt-1 xt xt+1 xt
Memory cell:
← ←
ht = LSTM(xt , ht−1 ) (20)
ct = ft • ct−1 + it • tanh(Wc xt + Uc ht−1 + bc ) (16)
The output is described as follows:
Output gate:
p yt = g(W~hy~ht + W← ht + by )
ot = σ(Wo xt + Uo ht−1 + bo ) (17)
where xt denotes the input data, yt denotes the output at
Result output:
time t, W~hy and W← denote the weight matrix, and by de-
ht = ot • tanh(ct ) (18) notes the bias term.
where xt denotes the input vector (such as a word vector)
at time t; f , i, and o denote the activation vectors of the 4.2.7. GRU:
forget gate, input gate, and output gate, respectively; c de- A GRU is an LSTM variant. It is well known that
notes the memory unit vector, h denotes the output vector LSTM can overcome the problem of gradient vanishing or
the LSTM unit, Wi , Ui , W f , U f , Wc , Uc , Wo , and Uo denote gradient exploding when dealing with the relationship of
the weight matrix; bi , b f , bc , and bo denote the bias vector; long-distance dependence, and it can keep the dependency
σ and tanh denote the activation function.
A Bi-LSTM is an improved LSTM model. One- network’s three-gate structure, a GRU has only two gates:
directional LSTM uses previous information to deduce an update gate and a reset gate. The network structure of
subsequent information, requiring information processing a GRU is shown in Fig. 7.
and preventing it from accessing future context or inte- The forward calculation of a GRU is described as fol-
grating context information, which affects system predic- lows:
tion performance. A Bi-LSTM uses two LSTM networks Update gate:
to train together and start their respective sequences from
opposite ends while being connected back to the same out- ut = σ(Wu xt + Vu ht−1 + bu ) (22)
put layer. Thus, it can integrate the past and future infor-
mation of each point. A Bi-LSTM includes forward and Reset gate:
backward calculation. The horizontal direction represents
the bi-directional flow of the time sequence, and the verti- rt = σ(Wr xt + Vr ht−1 + br ) (23)
cal direction represents the one-directional flow from the
input layer to the hidden layer and on to the output layer. Memory cell:
The network structure of a Bi-LSTM network is shown in
Fig. 6. h̃t = tanh(Wh xt + Vh (rt ∗ ht−1 ) + br ) (24)
The forward calculation of hidden vector ~h is described
as follows: Output:
ht = ut ht−1 + (1 − ut )h̃t (25)
~ht = LSTM(xt , ~ht−1 ) (19)
where Wu , Ur , Wh , Vu , Vr , and Vh denote the weight ma-
The backward calculation of hidden vector h is de- trix, bi , b f , bc , and bo denote the bias vector, and tanh
scribed as follows: denotes the activation function.
8 S. Peng, et al.
attention Concat
Scaled Dot-Product
value 1 value2 ĂĂ valuen Attention h
Linear Linear
Linear Linear
Linear Linear Linear
Fig. 8: The framework of attention
4.2.8. Attention:
An attention mechanism imitates human visual pro- Fig. 9: The framework of MHA
cessing (i.e., it aligns internal experience with external
feelings to increase the observation precision of certain
areas). For example, when browsing a picture, people first MatMul
scan the global image quickly to obtain a target area (i.e.,
attention point) that requires focus. More attention is then softmax
devoted to that point to obtain more detailed information
while other useless information is suppressed. The spe- Mas(opt.)
cific framework is shown in Fig. 8.
The calculation process is mainly divided into the fol-
e- Scale
lowing three steps.
Step 1: The similarity between the query and each key MatMul
is calculated to obtain the weight. Common similarity
functions include dot product, concatenating, and percep-
tron. The related descriptions are described as follows: Q K V
f (Q, K) =
(26) is repeated h times. The input of each time is the linear
Wa [Q; K] , concat
transformation of the original input. The SDA framework
vta tanh(Wa Q + Ua K) , perceptron
is shown in Fig. 10.
An SDA is an attention mechanism of similarity calcu-
Step 2: Generally, a softmax function (see Equation 27) lation using a point product, a series of queries, a series
is used to normalize these weights, of keys with dimensions dk , and a series of values with
dimensions dv . The calculation process is described as
exp( f (Q, Ki )) follows:
ai = softmax( f (Q, K)) = Pn (27)
j=1 exp( f (Q, K j )
QK T )
Step 3: The final attention is obtained by weighting and attention(Q, K, V) = softmax( √ )V (29)
summing the weight and corresponding key value, which
“Multi-head" denotes that a head is calculated each
is described as follows:
time, parameter W of the linear transformation for Q, K,
attention (Q, K, V) = ai Vi (28) and V is different each time, and the results of the h times
i SDA mechanism are concatenated. They are then con-
At present, NLP research tends to use identical keys ducted in a linear transformation once more to obtain a
and values. value, which is the MHA result. The calculation is de-
scribed as follows:
4.2.9. MHA:
MHA is an attention mechanism variant that uses multi- multihead(Q, K, V) = concat(head1 , head2 , · · · , headh )W o
ple queries to extract multiple groups of different informa- (30)
tion in parallel from input information for concatenating.
The multiple attention mechanism is shown in Fig. 9. headi = attention(QWiQ , KWiK , VWiV ) (31)
First, a linear transformation is made for the query, key,
and value. Then, they are input into the Scaled Dot Prod- where WiQ ∈ Rdk d̄ , WiK ∈ Rdk d̄ , WiV ∈ RdV d̄ , and W o ∈ RhdV d̄
uct Attention (SDA) mechanism, and the same operation denote a single attention function with d-dimensional
A Survey on Deep Learning for Textual Emotion Analysis
word-oriented representation learning. The basic under-
standing for related approaches are summarized as fol- ods. However, it only focuses on binary labels, weakening
lows. its generalization ability on other affect tasks.
5.1.7. fastText:
5.1.1. word2vec
The fastText library [59] can handle Out-Of-
The word2vec technique [53] consists of two models: Vocabulary (OOV) words by predicting their word
continuous bag-of-words (CBOW) and continuous skip- vectors based on learned character n-grams embedding.
gram. The CBOW model uses the average/sum of context While it requires little training time, without sharing
words as input to predict a current word. The skip-gram parameters, it has poor generalization for large output
model uses a current word as input to predict each con- spaces.
textual word. Word2vec has fewer dimensions than previ-
ous embedding methods, making it faster, more versatile, 5.1.8. context2vec:
and able to be used by various NLP tasks. Although it has
5.1.14. Emo2Vec:
encoder by predicting preceding and following sentences
Emo2Vec [66] is a multi-task learning method that en- using a current sentence. It follows the same idea as the
codes emotional semantics into vectors by using a CNN.
skip-gram model of the word2vec embedding method. It
It is trained by six different emotion-related tasks and can can predict the probability of a sentence appearing in a
encode emotional semantics into real-valued, fixed-sized given context through the current sentence, but its model
word vectors. e- training speed is slow.
on word2vec. It consists of a two-layer Bi-LSTM network rectional encoder representations through transformers. It
with a deep self-attention mechanism. It can overcome the
5.1.17. SVD NS: right contexts in all layers. It can obtain context sensi-
SVD NS [69] is a word embedding method in the tive bidirectional feature representation. However, there
context of NLP. It not only learns word-context co- is inconsistency between its pre-training process and gen-
occurrences, but also learns the abundance of unobserved eration process, which leads to poor effect on the gener-
or insignificant co-occurrences, improving word distribu- ation task. It also consumes more computing resources
tions in latent embedded space. than other existing models.
weighted least unsupervised multi-
3 Glove [55] 2014 Stanford University
squares regression language
semi- cross-
4 BiDRL [56] logistics regression 2016 Peking University
supervised language
Emoji2Vec Princeton University and
5 logistic regression supervised English 2016
[57] University College London
6 SSWE [58]
feed forward neural
supervised English 2016
Harbin Institute of Technol-
e- ogy, et al.
7 fastText [59] probability statistics supervised 2016 Facebook
context2vec unsupervised multi-
8 Bi-LSTM 2016 Bar-Ilan University
[60] language
nearest neighbor
9 REF [61] NA English 2017 Yuan Ze University, et al.
11 KUBWE [63] kernel-based English 2019 Dalhousie University
adapts these parameters to a target task using the corre- lingual methods, text-oriented cross-linguistic methods,
sponding supervised objective. It is a unidirectional auto- and emoji-oriented cross-linguistic methods.
regressive language model, and cannot obtain context sen-
sitive feature representation. 6.1. Text-oriented Monolingual Emotion Analysis Models
DL methods have been proven effective for many NLP
5.2.9. FastSent: tasks, including sentiment and emotion analysis. The fol-
FastSent [80] is a model for obtaining sentence embed- lowing are emotion analysis models for single language
ding that can predict words in context sentences based on based on DL methods.
Abdul-Mageed and Ungar [86] proposed a fine-grained
a current sentence. Its disadvantage is that it loses sen-
tence sequencing information. emotion detection method using Gated Recurrent Neu-
ral Networks (GRNNs). Tafreshi and Diab [87] proposed
a joint multi-task learning model using a GRNN, and
5.2.10. ERNIE: trained it with a multigenre emotion corpus to predict
ERNIE [81] uses a multi-layer transformer as basic en- emotions for four types of genres. Kulshreshtha et al. [88]
coder to capture contextual information. It is a method for proposed a neural architecture, Linguistic-featured Emoji-
learning language representation enhanced by knowledge based Partial Combination of Deep Neural Networks (LE-
PC-DNNs), for emotion intensity detection based on a
Paragraph unsupervised
1 log-bilinear English 2014 Google
Vector [72]
2 Thoughts RNN, GRU English 2015 University of Toronto, et al.
Bi-LSTM, attention
supervised English 2017
Massachusetts Institute of
e- Technology, et al.
unsupervised multi-
4 BERT [75] Transformer 2018 Google
Bi-LSTM, max pool-
5 InferSent [76] supervised English 2017 Facebook, et al.
CCTSenEmb unsupervised Beijing Institute of Tech-
6 Gaussian English 2019
[77] nology
Self-attention, Bi-
7 CAMSE [78] supervised English 2019 Tsinghua University
the second uses a CNN to extract emotion features. Zhang graph attention mechanism was designed to leverage com-
et al. [94] proposed a multi-task CNN for TEA, based on monsense knowledge, which augments the semantic in-
emotion distribution learning. formation of the utterance; an incremental transformer
Khanpour and Caragea [95] proposed a method for was used to encode multi-turn contextual utterances. in
emotion detection in online health communities, called addition, multi-task learning was used to improve the per-
ConvLexLSTM. It combined the output of a CNN with formance of emotion recognition.
lexicon-based features, then fed everything into a LSTM Jiao et al. [105] proposed a Hierarchical Gated Re-
network to produce the final output via the softmax mech- current Unit (HiGRU) framework with two Bi-GRUs,
anism. Yang et al. [96] proposed an interpretable neu- the lower-level Bi-GRU was used to learn the individ-
ral network model for the relevant emotion ranking, us- ual utterance embedding and the upper-level Bi-GRU was
ing a multi-layer feed-forward neural network. Kratzwald used to learn the contextual utterance embedding. Li et
et al. [97] proposed a text-based emotion recognition al. [106] proposed a fully data-driven Interactive Dou-
approach using an RNN, named sent2affect, that was a ble States Emotion Cell Model (IDS-ECM), for textual
tailored form of transfer learning for affective comput- dialogue emotion prediction. In the model, the Bi-LSTM
ing. Yang et al. [98] proposed a framework called In- and attention mechanism were used to extract the emotion
terpretable Relevant Emotion Ranking with Event-driven features. Li et al. [107] proposed a transformer-based
Attention (IRER-EA), based on an RNNs and the atten- context- and speaker-sensitive model for emotion detec-
tion mechanism. tion in conversations, namely HiTrans, which consists of
The specific comparison of existing text-oriented two hierarchical transformers. One was used to generate
monolingual emotion analysis models is listed in Table local utterance representations using BERT, and another
3. was used to obtain the global context of the conversation.
important to emotion analysis. If we can effectively de- text encoder, and the iterative improvement mechanism.
tect the emotion in a conversation, it has great commer- Li et al. [110] proposed a Hierarchical Transformer (Hi-
cial value (e.g., online customer service of an e-commerce Transformer) framework to address utterance-level emo-
platform). tion recognition in dialog systems. It used a lower-level
Ghosal et al. [99] presented the Dialogue Graph Con- transformer to model word-level input, an upper-level
volutional Network (DialogueGCN) for emotion recogni- transformer to capture the contexts of utterance-level em-
beddings, and BERT to obtain better individual utterance
2 Tafreshi’s [87] GRU word2vec, 2018 BLG+HLN:
Glove 0.836, MOV: 0.91
3 LE-PC-DNN [88] CNN 2018 EmoInt-2017 0.791
Mohammadi’s Glove, SemEval 2019 (Emo-
4 SVM, atten- 2019 0.7303
[89] ELMo Context)
5 Li’s [90] LSTM word2vec 2017 WeChat 0.2512
Bi-GRU, cap-
6 Sentylic [91] word2vec 2018 WASSA 2018 0.692
sule networks
CNN, LSTM, GloVe and EmoInt-2017, SemEval-
7 Akhtar’s [92] 2020 0.748
GRU word2vec 2017
Dailydialogs, Crowd-
FastText Tweets, Grounded- 0.593, 0.988 and
No. Name DL method Pre-training Year Dataset
1 GRU, GCN Golve 2019 IEMOCAP 0.6418
EC, DailyDialog, 0.7413, 0.5337,
2 KET [100] MHA Golve 2019 MELD, EmoryNLP, 0.5818, 0.3439,
IEMOCAP 0.5956
graph con-
3 ConGCN [101] volutional GloVe 2019 MELD 0.574
DialogueRNN RNN, CNN,
4 NA 2019 IEMOCAP 0.6275
[102] attention
DailyDialog, MELD, 0.5431, 0.6091,
5 RGAT [103] graph atten- BERT 2020
EmoryNLP, IEMOCAP 0.3442, 0.6522
tion networks
graph at-
tention EC, DailyDialog, 0.7539, 0.5471,
6 KAITML [104] mechanism, GloVe 2020 MELD, EmoryNLP, 0.5897, 0.3559,
Friends, EmotionPush, 0.744, 0.771,
7 HiGRU [105] Bi-GRU word2vec 2019
tection model, and the BERT model. mechanism. BAN can aggregate monolingual and bilin-
Winata et al. [116] used the hierarchical attention for gual informative words to form vectors from document
dialogue emotion classification based on logistic regres- representations; it can also integrate attention vectors to
sion and XGBoost. Bae et al. [117] proposed a method conduct emotion prediction. Zhou et al. [126] proposed
to detect emotion using a Bi-LSTM encoder for higher an attention-based cross-lingual sentiment classification
level representation. Liang et al. [118] proposed hierar- model that learns the distributed semantics of documents
chical ensemble classification of contextual emotion using in both source and target languages. In each language,
three sets of CNN-based neural network models trained they used LSTM to model documents and introduced a
for four-emotion classification, Angry-Happy-Sad classi- hierarchical attention mechanism for the model. Chen
fication, and Others-or-not classification respectively. et al. [127] presented an Adversarial DAN (ADAN) for
Xiao [119] designed ensemble of transfer learning cross-lingual sentiment classification. ADAN could trans-
methods using pre-trained language models (ULMFiT, fer knowledge learned from labeled English data to Chi-
OpenAI GPT, and BERT). He also trained a DL model nese and Arabic, where little or no annotated data existed.
from scratch using pre-trained word embedding and Bi- Zhou et al. [128] proposed an Bilingual Sentiment
LSTM architecture with the attention mechanism. The Word Embeddings (BSWE) method, based on DL tech-
conducted experimental result reveals that ULMFiT can nology, for English-Chinese cross-language sentiment
perform best due to its fine-tuning technique. Li et al. classification. BSWE could use DAE to learn bilingual
[120] proposed a multi-step ensemble neural network for embeddings for Cross-language Sentiment Classification
emotion analysis in text. They used four DL models (CLSC). Feng and Wan [129] proposed a Cross-Language
(LSTM, GRU, CapsuleNet, and Self-Attention) obtained In-domain Sentiment Analysis (CLIDSA) model based
eight different models by combining two different word on LSTM. It was an end-to-end method that leveraged
embedding models. They then used Dropout to support
improved model convergence. Finally, at each model out-
p unlabeled data in multiple languages and multiple do-
mains. Barnes et al. [130] proposed a Bilingual Senti-
put, the four predicted probability categories were ob- ment Embeddings (BLSE) model that used a two-layer
tained. Ragheb et al. [121] presented a model to detect feed-forward averaging network to predict text sentiment.
textual conversational emotion. They used deep transfer Ahmad et al. [131] built a DL model for emotion de-
learning, self-attention mechanisms, and turn-based con- tection of Hindi language. They used a CNN, Bi-LSTM,
versational modeling to classify emotion. cross-lingual embeddings, and different transfer learning
Lee et al. [122] proposed a multi-view turn-by-turn strategies for their purpose.
model. In this model, the vectors were generated from The specific comparison of existing text-oriented cross-
each utterance using two encoders: a word-level Bi-GRU linguistic emotion analysis models is listed in Table 6.
encoder and a character-level CNN encoder. The model
could predict emotion with the contextual information, 6.4. Emoji-oriented cross-linguistic emotion analysis
information from an utterance. Ge et al. [124] proposed “A small digital image or icon used to express an idea
an attentional LSTM-CNN model for dialog emotion clas- or emotion”. As a way to enhance the visual effect and
sification. They used a combination of CNNs and long- meaning of short text, emojis are becoming one of the
short term neural networks to capture both local and long- indispensable components in any instant messaging plat-
distance contextual information in conversations. In ad- form or social media service. Due to emojis becoming
dition, they applied the attention mechanism to recognize increasingly important in emotion analysis on social net-
and attend to important words within conversations. They works, the SemEval-2018 task 2: Emoji Prediction in En-
also used ensemble strategies by combing the variants of glish and Spanish [133], was introduced in 2018. The aim
the proposed model with different pre-trained word em- was to attract greater NLP attention. The basic under-
bedding via weighted voting. standing for related methods are summarized as follows.
The specific comparison of existing text conversation- Coltekin and Rama [134] designed a supervised system
oriented monolingual emotion analysis models from consisting of an SVM classifier with a bag-of-n-grams
SemEval-2019 Task 3 is listed in Table 5. features. Baziotis et al. [135] proposed a architecture
to predict emojis using Bi-LSTM and a context-aware at-
tention mechanism. Beaulieu and Owusu [136] proposed
6.3. Text-oriented Cross-linguistic Emotion Analysis
a method to predict English and Spanish emojis using a
bag-of-words model and a linear SVM. Coster et al. [137]
In this subsection, we will provide an survey on built a linear SVM model to predict emoji in Spanish
the text-oriented cross-linguistic emotion analysis model. tweets using the SKLearn SGDClassifier.
The basic understanding for related approaches are sum- Jin and Pedersen [138] built a classifier for Span-
marized as follows. ish emoji prediction using naive Bayes, logistic regres-
Wang et al. [125] proposed a Bilingual Attention Net- sion, and random forests. Basile and Lino [139] pre-
work (BAN) model based on LSTM and the attention sented an approach to predict Spanish emoji based on the
S. Peng, et al.
Table 5: Comparison of existing text conversations-oriented monolingual emotion analysis models from SemEval-2019 Task 3
CNN, attention
7 Figure Eight [119] 9 Bi-LSTM, attention ASLP, DeepMoji, 0.7608 U.S.
8 YUN-HPCC [120] 10 ELMO, GloVe 0.7588 China
Capsule-Net, attention
9 11
e- ULMFit 0.7582 France
[121] LSTM, attention
10 MILAB [122] 12 CNN, Bi-GRU GloVe 0.7581 Korea
11 PKUSE [123] 14 Bi-LSTM, attention GloVe 0.7557 China
GloVe, word2vec,
12 THU NGN [124] 15 CNN, LSTM, attention 0.7542 China
n al
SVM model. Liu [140] presented a model for English which has serious negative impact on the research and
emoji prediction using a gradient boosting regression tree performance of non-English language emotion analysis
method. Lu et al. [141] proposed a method to address methods.
Twitter emoji prediction based on Bi-LSTM and the at-
tention mechanism.
The specific comparison of existing emoji-oriented 7.4. Domain Relevance
cross-linguistic emotion analysis models is listed in Table
Descriptive words and phrases, such as “a long time",
can express different emotions depending on their do-
main. For instance, food and beverage reviews often ex-
7. Challenges of Emotion Analysis press negative emotion in relation to long waiting times,
while smartphone reviews express positive emotion in re-
Due to increasing development of social networks and lation to long battery standby times [145]. Thus, the do-
DL technology, unprecedented challenges to emotion main relevance of words must be considered by emotion
analysis have been posed. Though many researchers have analysis. Cross-domain emotion analysis presents numer-
proposed potential solutions for some of the discussed is- ous pressing problems for resolution.
sues, there are still many other open issues requiring fur-
ther exploration and deep study [142]. In this section, we
summarize the challenges of emotion analysis, and point 7.5. Understanding Short Texts
out the future trends in this field.
Social networks limit the length of their commentary,
7.1. Emotion Description making short text (with its sparseness, non-standard use
of words, and massive data) common instead of tradi-
At present, there is no unified definition for emotion
tional long text. At the same time, insufficient contextual
and no unified standard to classify emotions effectively
semantic information, single-word polysemy, and multi-
and scientifically, which may affect emotional feature
word synonymy make topic information extraction diffi-
extraction performance involving texts. However, be-
cult to perform accurately, affecting final emotion analy-
cause of the three unique components of human emotion
sis performance. Thus, understanding short texts is a very
(physiological arousal, subjective experience, and exter-
challenging task in NLP.
nal expression), different fields possess different under-
standings. For example, social psychology, developmen-
tal psychology, and neuroscience deem it impossible for 7.6. Emotion Cause Extraction (ECE)
researchers to have the same understanding of emotion
[142]. Thus, there is difficultly in determining a unified ECE aims to identify important potential causes or
standard to accurately characterize human emotions. stimuli for observed emotions during in-depth emotion
POS embed-
7 TAJJEB [140] SVM 8 4 Malta, Spain
naive Bayes,
Duluth UROP logistic regres-
8 NA 18 5 U.S.
[141] sion, random
8. Future Research Trends video, and image) [150, 151] can often express emotional
effect than text with greater description and vividness. In
As Internet +, AI +, 5G, and other opportunities
addition, as a main carrier of emotional information ex-
arise, many new applications (e.g., multi-language, multi-
pression, voice can accurately reflect current user emo-
modal, cross-domain, big data) have emerged and pro-
tions. Thus, allowing for the combined study of vari-
development of social network analysis and DL technol- method is that emotions present in current comment in-
ogy offer many new research directions for emotion analy- formation can be identified accurately and quickly, pro-
sis. Some researchers are gradually changing the focus of vided said information contains words expressing various
research in emotion analysis, from single language, sin- emotions from different domains. However, traditional
gle media, single domain, and small-scale data samples to emotion analysis methods often ignore domain depen-
multi-language, multi-modal, cross-domain, and big data dency characteristics of emotional words, and may even
[148, 149]. According to existing technology develop- deliberately choose domain-independent features (such as
ment trends, future emotion analysis research will include emoji). With increasing demand for practical application
the following aspects. and the emergence of emotional corpus resources in dif-
ferent domains, cross-domain emotion analysis will draw
8.1. Multi-language Emotional Analysis greater research attention and focus.
Due to increasing cultural exchanges, multi-language
network information affects and merges with itself. Exist-
8.4. Emotion Analysis based on Social Network Analysis
ing work has focused on a single language and corpus re-
sources collected for a single-language emotion analysis With the rapid development of social networks, a large
model cannot be applied to multi-language emotion anal- amount of user interaction data has been generated. These
ysis. In addition, corpus resources for the emotion anal- data not only reflect static user characteristics (e.g., num-
ysis of different languages are also unbalanced, making ber of friends, activity, frequency of surfing the Inter-
their application to multi-language environments difficult. net), but also reflect dynamic user characteristics (e.g.,
thoughts, social relations, social influence). Through the
8.2. Multi-modal Emotional Analysis analysis of social networks, we can understand how differ-
While traditional emotion analysis focuses on single ent individuals and social groups express their emotions
forms of media, multi-modal information (e.g., audio, and how group emotional tendencies relate to popular
A Survey on Deep Learning for Textual Emotion Analysis
with improved performance.
tion analysis based on big data.
8.6. In-depth Emotion Analysis 10. Declaration of competing interest
The purpose of extracting emotion cause is to recognize
The authors declared that they have no conflicts of in-
the potential cause or stimulus of an observed emotion.
terest to this work.
Existing methods of emotion analysis focus on the shal-
low tasks of emotion recognition and classification. How-
ever, emotion cause identification requires in-depth emo- Acknowledgements
tion analysis that focuses on emotional keywords in text
to identify causes automatically. Although current main- This work is partially supported by the National Nat-
stream methods are based on linguistic rules and statistics, ural Science Foundation of China under Grant Nos.
the wide application of DL will continue to attract increas- 61876205 and 61877013, the Ministry of education of
ing attention in ECE research. Humanities and Social Science project under Grant Nos.
Short Texts
At present, there are a large number of comments Language Engineering and Computing under Grant No.
A Survey on Deep Learning for Textual Emotion Analysis
A Survey on Deep Learning for Textual Emotion Analysis
S. Peng, et al.
p ro
n al
