TG Net
TG Net
TG Net
i
NOVA Information Management School
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
MULTILINGUAL EMAIL ZONING
Segmenting Multilingual Email Into Zones
by
João Bruno Morais de Sousa Jardim
Dissertation presented as the partial requirement for obtaining a Master's degree in Information
Management, Specialization in Knowledge Management and Business Intelligence
Advisor: Professora Doutora Mariana Sá Correia Leite de Almeida
Co‐supervisor: Ricardo Costa Dias Rei
March 2021
ii
ACKNOWLEDGEMENTS
I want to thank Nuno Carneiro, Ricardo Rei, Mariana Almeida, the entire Cleverly AI team and all
Cleverly annotators, my Master’s colleagues and teachers, my loving parents and family, Eva and my
dearest friends.
This project has received funding from the European Union’s Horizon 2020 research and innovation
program under grant agreement No 873904.
iii
iv
ABSTRACT
The segmentation of emails into functional zones (also dubbed email zoning) is a relevant
preprocessing step for most NLP tasks that deal with emails. In this research, we analyze in depth the
email zoning literature and develop a business case around CLEVERLY AI, a company from the
Customer Service sector. We design a new email zoning classification schema and collect a
multilingual corpus of emails from CLEVERLY AI clients. We develop five neural network-based email
zoning systems, among those systems, we introduce OKAPI, the first multilingual email zoning model
based on a language agnostic sentence encoder. Besides outperforming our other systems when
tested on CLEVERLY’s emails, OKAPI shows competitive performances with current English public
benchmarks and reached new state-of-the-art results for English domain adaptation tasks. Moreover,
we release a new multilingual benchmark, composed of 625 emails in Portuguese, Spanish and
French, and demonstrate OKAPI can effectively generalize its learnings for unseen languages.
KEYWORDS
Natural Language Processing; Machine Learning; Email Zoning; Text Segmentation; Multilingual;
Customer Service
v
INDEX
1. Introduction .................................................................................................................. 1
1.1. Motivation ............................................................................................................. 1
1.1.1. Email Zoning ................................................................................................... 1
1.1.2. Cleverly Case Study ........................................................................................ 2
1.2. Objectives and Methodology ................................................................................ 3
2. Background ................................................................................................................... 6
2.1. Supervised Machine Learning ............................................................................... 6
2.1.1. Perceptron ...................................................................................................... 7
2.1.2. Multilayer Perceptron .................................................................................... 8
2.1.3. Backpropagation............................................................................................. 8
2.1.4. Overfitting and Regularization ....................................................................... 9
2.1.5. Convolutional Neural Networks ................................................................... 10
2.1.6. Conditional Random Fields ........................................................................... 11
2.1.7. Recurrent Neural Networks ......................................................................... 12
2.1.8. Attention Mechanism ................................................................................... 14
2.1.9. The Transformer ........................................................................................... 14
2.2. Text Representation Models ............................................................................... 15
2.2.1. Sparse Models .............................................................................................. 16
2.2.2. Dense Models ............................................................................................... 17
3. Related Work .............................................................................................................. 22
3.1. Email Zoning ........................................................................................................ 22
3.1.1. JANGADA ......................................................................................................... 22
3.1.2. ZEBRA ............................................................................................................. 24
3.1.3. QUAGGA.......................................................................................................... 26
3.1.4. CHIPMUNK ....................................................................................................... 28
3.1.5. Email Zoning Public Corpora ........................................................................ 30
3.2. Text Segmentation .............................................................................................. 32
4. Methodology .............................................................................................................. 35
4.1. Corpora ................................................................................................................ 35
4.1.1. CLEVERLY AI Corpus ......................................................................................... 35
4.1.2. Public Corpora .............................................................................................. 39
4.1.3. New Multilingual Email Zoning Corpus ........................................................ 41
4.2. Models ................................................................................................................. 43
vi
4.2.1. Baseline, Word Embeddings + BiLSTM (W-BiLSTM) ..................................... 43
4.2.2. Word and Subword Embeddings + BiLSTM (WSw-BiLSTM) ......................... 46
4.2.3. XLM-RoBERTa Embeddings + BiLSTM (XLMR-BiLSTM) ................................. 46
4.2.4. XLM-RoBERTa Embeddings + BiLSTM + CRF (XLMR-BiLSTM-CRF or OKAPI) 48
4.3. Evaluation Metrics ............................................................................................... 48
4.3.1. Email Zoning ................................................................................................. 48
4.3.2. Inter-annotator Agreement ......................................................................... 50
5. Results and discussion ................................................................................................ 53
5.1. Experimental Setup ............................................................................................. 53
5.2. Cleverly Results.................................................................................................... 54
5.2.1. CLEVERLY Annotated Corpus........................................................................... 54
5.2.2. Best Model Analysis ..................................................................................... 56
5.2.3. Impact in CLEVERLY Pipeline ........................................................................... 58
5.3. Public Corpora Results ......................................................................................... 59
5.3.1. English Corpora ............................................................................................ 59
5.3.2. Multilingual Corpus ...................................................................................... 61
6. Conclusions ................................................................................................................. 64
7. Limitations and recommendations for future works ................................................. 66
8. Bibliography ................................................................................................................ 68
vii
LIST OF FIGURES
Figure 1.1 – Example email with two identified functional segments: authored content and
advertisement. ................................................................................................................... 1
Figure 1.2 – Overview of CLEVERLY AI ticket classification pipeline with email zoning as a
preprocessing step. ............................................................................................................ 2
Figure 2.1 – Unfolded Recurrent Neural Network. Taken from https://tinyurl.com/yy79zmxo.
.......................................................................................................................................... 12
Figure 2.2 – Internal Representation of an LSTM cell. Taken from
https://tinyurl.com/yy79zmxo ......................................................................................... 13
Figure 2.3 – The architecture of the Transformer Model. Taken from
https://tinyurl.com/y2jvw3m3. ....................................................................................... 15
Figure 2.4 – BERT-Base and BERT-Large Encoder Stack. Taken from
https://tinyurl.com/y3werbz7 ......................................................................................... 19
Figure 3.1 – Example of a JANGADA labeled email message adapted from Carvalho & Cohen
(2004). .............................................................................................................................. 23
Figure 3.2 – Example of a labeled email message with both three- and nine-zone
classification adapted from Lampert et al. (2009). .......................................................... 25
Figure 3.3 – Example of a QUAGGA labeled email message with both two- and five-zone
annotations, adapted from Repke & Krestel (2018). ....................................................... 27
Figure 3.4 – QUAGGA model overview. Taken from Repke & Krestel (2018). ........................... 28
Figure 3.5 – Example of a CHIPMUNK labeled email message, adapted from Bevendorff et al.
(2020). .............................................................................................................................. 29
Figure 3.6 – CHIPMUNK model architecture. Taken from Bevendorff et al. (2020). .................. 29
Figure 3.7 – Proposed Transformer-based segmentation models (Lukasik et al., 2020). ....... 33
Figure 4.1 – Excerpt from a CLEVERLY labeled email message. Personal and other sensible
instances were replaced by a default token that indicates their content. ...................... 36
Figure 4.2 – Percentage of total lines for email zone, for each language in CLEVERLY corpus. . 37
Figure 4.3 – Percentage of the total lines per zone. Comparison between the Enron corpus
(green/left) and the ASF corpus (yellow/right). ............................................................... 40
Figure 4.4 – W-BiLSTM model overview. The model is divided in two parts: 1) a sentence-
level word2vec word encoder; and 2) a segmentation module that uses a BiLSTM and a
softmax output layer to classify each sentence into an email zone. Although the BiLSTM
receives the sequence of sentences in an email, for simplicity, we illustrate the process
for a single sentence. ....................................................................................................... 44
viii
Figure 4.5 – Sw-BiLSTM model overview. The model uses the same architecture as the W-
BiLSTM but encodes sentences at the subword-level. .................................................... 45
Figure 4.6 – WSw-BiLSTM model overview. The model produces parallel sentence
representations at word and subword level. The representations are concatenated for
each sentence and fed into the output layer................................................................... 46
Figure 4.7 – Overview of XLM-RoBERTa embedding extraction steps. ................................... 47
Figure 4.8 – XLMR-BILSTM is composed of two building blocks: 1) a multilingual sentence
encoder (XLM-RoBERTa) to derive sentence embeddings; and 2) a segmentation
module that uses a BiLSTM and a softmax output layer to classify each sentence into an
email zone. ....................................................................................................................... 47
Figure 4.9 – OKAPI model overview. OKAPI follows the same architecture as XLMR-BiLSTM
model except for the output layer, in which it uses CRF to classify each sentence into an
email zone. ....................................................................................................................... 48
Figure 5.1 – Confusion matrix for OKAPI’s email zoning results on CLEVERLY’s multilingual
corpus. On the left the true labels and on the bottom the predicted labels. The darker
the square, the more lines are predicted with the column’s label. ................................. 57
ix
LIST OF TABLES
Table 3.1 – Summary of existing email zoning corpora. *Note that, although Bevendorff et al.
(2020) Gmane corpus is technically multilingual, it only has 38 non-English test emails
that are spread over 13 different languages. .................................................................. 30
Table 4.1 – Statistics for CLEVERLY corpus discriminated by language. ..................................... 37
Table 4.2 – Statistics for each email zone in CLEVERLY corpus. ................................................. 38
Table 4.3 – Train, validation and test splits for each language in CLEVERLY corpus. ................. 38
Table 4.4 – Repke & Krestel (2018) available Enron and ASF corpus in numbers considering 5
zones................................................................................................................................. 39
Table 4.5 – Bevendorff et al. (2020) available Gmane and Enron corpus in numbers. ........... 40
Table 4.6 – Statistics for our multilingual email zoning corpora. The distribution was obtained
by averaging the values for both annotators. .................................................................. 41
Table 4.7 – Distribution, for each language, of the number lines per zone in our Multilingual
corpus. The distribution was obtained by averaging the values for both annotators. ... 43
Table 4.8 – Evaluation metrics used to measure model performance for each corpus. *For
CLEVERLY, we only use overall precision and overall recall when analyzing the results of
the best model for each zone. ......................................................................................... 49
Table 5.1 – f1-score comparison of embedding method impact for CLEVERLY corpus. Models
were trained and tested for each language. Best result is highlighted for each train
language. .......................................................................................................................... 55
Table 5.2 – f1-score comparison of output layer impact for the model with higher performing
embedding method. The models were trained and tested for each language. Best result
is highlighted for each train-test language combination. ................................................ 56
Table 5.3 – OKAPI zone-level performance on CLEVERLY’s multilingual corpora. ....................... 56
Table 5.4 – OKAPI two-level classification email zoning performance on CLEVERLY corpus....... 58
Table 5.5 – Predicted corpus statistics and statistics regarding time taken by OKAPI on main
email zoning steps. ........................................................................................................... 58
Table 5.6 – CLEVERLY’s classifier accuracy using the current ticket preprocessing method
versus OKAPI. ..................................................................................................................... 59
Table 5.7 – OKAPI email zoning performance (precision/recall/accuracy) compared to various
models from the literature, for Repke & Krestel (2018) corpora in a 2-zone and 5-zone
schema. ............................................................................................................................ 60
Table 5.8 – OKAPI email zoning overall accuracy and 6 most common zones recall, compared
to various models, under the 15-level zoning schema and corpora of Bevendorff et al.
(2020). .............................................................................................................................. 60
x
Table 5.9 – Comparison of OKAPI and QUAGGA capacity to generalize learnings, for Repke &
Krestel (2018) Enron and ASF corpora. ............................................................................ 61
Table 5.10 – Inter-annotator agreement for each language in our multilingual zoning corpus,
using Cohen’s kappa (k), accuracy and f1-score between annotators A1 and A2. ............ 61
Table 5.11 – Multilingual zero-shot evaluation of OKAPI trained with Bevendorff et al. (2020)
English Gmane Corpus. Global accuracy (all zones) and each zone recall computed by
averaging the scores for both sets of annotations. ......................................................... 62
xi
LIST OF ABBREVIATIONS AND ACRONYMS
ML Machine Learning
GD Gradient Descent
FC Fully Connected
BoW Bag-of-Words
OOV out-of-vocabulary
xii
xiii
1. INTRODUCTION
1.1. MOTIVATION
Worldwide, email is a predominant means of social and business communication, with an estimated
306 billion business and consumer daily emails, sent by more than 4 billion users1. Its importance has
attracted studies in areas of Machine Learning (ML) and Natural Language Processing (NLP),
impacting a wide range of applications, from spam filtering (Qaroush, Khater, & Washaha, 2012) to
network analysis (Christidis & Losada, 2019).
In particular, Customer Service has benefited from the tasks that take advantage of the
increasing amount of available customer email text. Linguistic features from client emails can be
used to automatically classify emails (Coussement & den Poel, 2008), to extract problem descriptions
from email-based conversations in multiple languages (Koehler et al., 2018), and can lead to a more
efficient answer template selection (Sneiders, 2016). These tasks allow companies to make the
answering process less complex, improving the way customers are handled and increasing the overall
business success.
The body of an email is commonly perceived as unstructured textual data with multiple formats.
However, it is possible to discern a level of formal organization in the way most emails are written.
Different functional segments can be identified such as greetings, signatures, quoted content, legal
disclaimers, advertisements, etc. An example of email segment identification is shown in Figure 1.1.
The February NCL Regular Meeting will be this coming Monday, February 12, beginning at 7:30
p.m. at Cypress Creek Community Center.
authored We will be collecting items to donate to Northwest Assistance Ministries, which has suffered a
content serious cutback in funding.
They are particularly in need of personal care items such as toothbrushes, toothpaste,
deodorant, hair care items, etc.
Do You Yahoo!
Yahoo!
advertisement
Auctions - Buy the things you want at great prices.
http://auctions.yahoo.com
Figure 1.1 – Example email with two identified functional segments: authored content and
advertisement.
The segmentation of email text into parts, also known as email zoning (Lampert, Dale, & Paris,
2009), has become a prevalent preprocessing task for a diversity of downstream applications, such as
1
▪ The Radicati Group, Inc Email Statistics Report, 2020-2024 Executive Summary available at
https://www.radicati.com/?p=16510
1
author profiling (Estival, Gaustad, Pham, Radford, & Hutchinson, 2007), request detection (Lampert,
Dale, & Paris, 2010), uncover of technical artifacts (Bettenburg, Adams, Hassan, & Smidt, 2011),
automated template induction (Proskurnia et al., 2017), email classification (Kocayusufoglu et al.,
2019), or automated email response suggestion (Chen et al., 2019).
Due to its only recent establishment as a formal task, email zoning literature is not vast and
there is no absolute taxonomy defined for the email zones. Even though some of the zones tend to
be present across multiple contexts, others are domain-specific and require an extensive look at the
email corpora and a clear understanding of the research needs. Furthermore, although email
communication is a worldwide phenomenon, the email zoning literature is English-centric.
The email processing tasks used in Customer Service could benefit from the identification of email
functional zones, since by segmenting the email into its formal divisions, companies can extract only
the zones that are pertinent for downstream applications. Moreover, most companies deal with
clients from all over the world, making Customer Service applications highly multilingual.
To understand the real impact of an email zoning solution in a Customer Service company,
we build a case study around the company CLEVERLY AI. CLEVERLY AI, or just CLEVERLY, is a Portuguese
startup that offers a knowledge management platform for customer support teams. CLEVERLY's
software works by adding an Artificial Intelligence layer on top of other companies' help desk
systems, granting them a database of internal procedures such as triage of client’s requests,
assistance of help desk agents and automatic replies.
The assistance offered to help desk agents can be further divided in suggesting them with
reply options, as well as the enabling of resolution procedures. These capabilities are achieved by,
among other processes, automatically classifying incoming client emails (tickets), based on the
predictions outputted by an advanced machine learning classification model that underlays CLEVERLY's
platform.
Figure 1.2 – Overview of CLEVERLY AI ticket classification pipeline with email zoning as a preprocessing
step.
The effectiveness of a classification model depends on the quality of the input data. In
CLEVERLY's case, the quality of the classification of tickets into categories depends on the amount of
viable information contained in the ticket corpus used to train the classifier. One of the ways to
improve the quality of the corpus is, thus, to filter the zones of the tickets that do not contribute to
the model's understanding of the problems.
To improve its classification model, CLEVERLY wants to introduce an email zoning system as
part of its classification pipeline. Figure 1.2 depicts an overview of this new pipeline. By adding an
email zoning system as an upstream task, non-optimal ticket content is removed. Currently, the
2
company is preprocessing tickets by mixing a set of hand coded rules with the machine learning
package Talon2, in order to remove quoted text and signature lines. Nevertheless, this type of
approach may not be sufficiently dynamic for a company that deals with an ever-emerging client
base with tickets differing in the email formal layout and language.
This way, the CLEVERLY team believes that they will benefit from the implementation of an
email zoning solution based on an advanced machine learning model, one that is more flexible to
changes in the tickets features and able to deal with multiple languages.
1. We discuss the existing email zoning corpora, the corresponding email zoning classification
schemas, the existing email zoning systems, and their limitations.
2. We design a new email zoning classification schema for CLEVERLY and annotate zones for
15,547 tickets in 5 languages – English, Portuguese, Spanish, French and Italian.
3. We create the first publicly available multilingual annotated corpus for email zoning. This
corpus consists of 625 emails in 3 languages - Portuguese, Spanish and French - and
encompasses 15 email zones.
4. We introduce five email zoning systems, among them OKAPI, a multilingual email
segmentation system built on top of XLM-RoBERTa (Conneau et al., 2020) that can be easily
extended to 100 languages. To the best of our knowledge, OKAPI is the first end-to-end
multilingual system exploring pre-trained transformer models (Vaswani et al., 2017) to
perform email zoning.
5. We use our five systems to segment CLEVERLY’s annotated tickets, evaluate and compare their
effectiveness in segmenting CLEVERLY’s tickets. We also describe the time taken by our best
system (OKAPI) to process a total of 84,929 tickets and compare its performance against
CLEVERLY’s current solution by measuring the performance of the downstream ticket
classification model with each approach.
6. We test OKAPI against other systems from the literature in different email zoning corpora
including our new multilingual corpus, showing that besides effectively generalizing for
unseen languages, OKAPI reaches competitive results with current English benchmarks and
state-of-the-art performances for domain adaptation tasks.
A detailed account of each of the subjects listed above is given in the remainder of this
thesis: Background presents a review of the concepts that are the basis for the development of this
2
https://github.com/mailgun/talon
3
research; Related Work discusses previously implemented email zoning methods and provides a
comprehensive review of existing email zoning corpora, as well as relevant work in the area of text
segmentation; Methodology defines the zoning schema adopted for CLEVERLY’s context and presents
a thorough analysis of CLEVERLY’s corpus and other public corpora, namely our new multilingual
corpus. This section also presents the email zoning systems (models) developed and the evaluation
metrics used to measure their performance. Results and discussion reports and discusses the
experiments done and results achieved for CLEVERLY corpora and public corpora; Conclusions
summarizes and concludes the work developed throughout this research; and Limitations and
recommendations for future works addresses limitations of our work and interesting ideas to be
developed in the future.
Finally, we must say that during the development of this research, we have had a paper
accepted at EACL Student Research Workshop3 taking place in conjunction with EACL 2021. The
paper is available at https://arxiv.org/abs/2102.00461 and https://github.com/cleverly-
ai/multilingual-email-zoning (Jardim, Rei, & Almeida, 2021).
3
https://sites.google.com/view/eaclsrw2021/home
4
5
2. BACKGROUND
This section presents the fundamental machine learning concepts that are the basis for this research.
We review supervised machine learning (section 2.1) and discuss how to convert natural language
texts into representations capable of being interpretable by machine learning algorithms (section
2.2).
Supervised learning can be divided into classification and regression problems. This division is
related to the nature of 𝑦 values. In case 𝑦 is a set of finite categorical values and the model seeks to
predict the category 𝑥 belongs to, we have a classification problem. Some examples of classification
problems include spam detection, fraud detection, or sentiment analysis. On the other hand, if 𝑦 is a
continuous variable of real values, we have a regression problem. House price prediction and
temperature prediction are examples of regression problems.
One of the most simple and popular classification algorithms is the K-Nearest Neighbors
(KNN) (Fix & Hodges, 1989). This algorithm relies on the assumption that similar instances belong to
the same categories. Based on the input features, the KNN algorithm calculates the distance
between the new input and the training instances using a similarity function and, finally, assigns the
input to the most common class of its 𝑘 closer training instances, where 𝑘 is a positive integer,
generally of small value. This algorithm is considered non-linear since its decision boundaries on the
feature space are not linear.
On the contrary, linear methods deal with linear decision boundaries. The objective of linear
methods is to model the relationship between a dependent variable 𝑦 and the explanatory variables
𝑥 = {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } as a function 𝑓(𝑥). A Linear Regression is an example of a linear method. This
method, tries to model the relationship between two variables by fitting a linear equation to the
observed data:
𝑦 = 𝛽0 + 𝛽1 𝑥 (2.1)
Equation 2.1 describes a line where 𝑦 is the dependent variable 𝑥 the explanatory variable,
𝛽0 is the value of 𝑦 when 𝑥 = 0, the intercept, and 𝛽1 is the slope of the line.
Supervised learning methods can also be divided into generative methods and discriminative
methods, based on the way classifiers are computed. We resort to a generative model, if to compute
a label 𝑦 based on observation 𝑥, we estimate the joint distribution 𝑃(𝑥, 𝑦) and from that compute
the probability 𝑃(𝑦|𝑥), as a basis for the classifier. This means the data is modeled based on how it
was generated. A simple example of a generative method is the naïve Bayes model, defined below in
equation 2.2:
6
𝑃(𝑥|𝑦)𝑃(𝑦)
𝑃(𝑦|𝑥) = (2.2)
𝑃(𝑥)
This model assumes that the predictors 𝑥 are independent and have the same importance,
thus being called naïve. It then calculates the posterior conditional probability of 𝑦 having 𝑥 and
assigns 𝑦 to the class with the highest probability.
If the problem to be modeled requires the assumption that the inputs are interdependent
between each other, i.e. the input data is sequential, we cannot follow a naïve approach like the one
described using the naïve Bayes model. We need an approach that enables the modeling of a
sequence of input and output instances. Hidden Markov Models (HMM) (Baum & Petrie, 1966) are
an example of generative models capable of modeling the joint distribution 𝑃(𝑦, 𝑥) when dealing
with sequential data.
HMM assume that a random observed variable is not the state of a system but simply data
generated by hidden states of that system. Thus, the occurrence of an observation is conditioned by
the underlying state. As the name reveals, HMM rely on Markov process assumptions (Markov,
1953), which state that if the present state in a sequence is known, no more information is needed to
predict the immediate future state. This way, given a set of hidden states and a set of observed
variables, the model calculates the probability of a sequence of hidden states (transition
probabilities), then, using the present state, it can calculate what is the observation with the highest
probability of occurring and/or the most likely next state.
Diverse types of neural networks were developed taking into consideration the emergence of
new tasks and the capabilities provided with the ever-increasing computational power. Some of
them like the Multilayer Perceptron (Rumelhart, Hinton, & Williams, 1986), Convolutional Networks
(LeCun et al., 1989), and Recurrent Neural Networks (Elman, 1990) will be addressed later in this
section and have been showing state-of-the-art results in a broad range of NLP tasks, such as
machine translation and question-answering (Zhou, Duan, Liu, & Shum, 2020).
2.1.1. Perceptron
A Perceptron (Rosenblatt, 1958) is the most basic form of an artificial neural network, being the
building block of other neural network variations. The perceptron is composed of an input layer, with
a weight matrix 𝑊, a bias term 𝑏, and an activation function 𝑔(𝑧). The perceptron works by receiving
an input 𝑥 of 𝑛 numeric features, taking the dot product between the input and 𝑊, summing the
values with the bias term 𝑏, leading to a score 𝑧 = 𝑥𝑊 + 𝑏. This score is then passed through an
activation function g(z). The perceptron is defined as follows:
7
𝑃𝑒𝑟𝑐𝑒𝑝𝑡𝑟𝑜𝑛(𝑥) = 𝑔(𝑥𝑊 + 𝑏) (2.3)
Some of the most common activation functions are the Logistic (sigmoid), the Hyperbolic
Tangent (tanh), the Rectified Linear Unit (ReLU), and the Normalized Exponential function (softmax).
These functions allow the Perceptron to output a nonlinear function. Taking ReLU as an example, the
function can formally be defined by the following equation:
This function gives an output 𝑅(𝑥), where the value is 0 if 𝑥 is less than 0 and 𝑧 if 𝑧 is equal
or above 0. Another popular activation function is the softmax. This function takes its input vector
and normalizes it into numbers that can be seen as a probability distribution. The function works by
taking the exponents of each element 𝑧𝑖 from the previous layer output 𝑧 = {𝑧1 , . . . , 𝑧𝐾 } and then
normalizing each number by the sum of the exponents, as shown in the equation below:
𝑒 𝑧𝑖
𝜎(𝑧)𝑖 = 𝑧𝑗 𝑓𝑜𝑟 𝑖 = 1, … , 𝐾 (2.5)
∑𝐾
𝑗=1 𝑒
The softmax is typically used as the output layer of a neural network because it enables the
network to output a score of the input value for a specific class. As for the ReLU, although it can be
used as the output activation function, it is mostly used as an activation function in intermediary
layers of more complex neural networks, such as the Multilayer Perceptron (Rumelhart et al., 1986).
While in a Perceptron (Rosenblatt, 1958) there is only one layer where activation of the inputs
occurs, in a Multilayer Perceptron (MLP) (Rumelhart et al., 1986) there are two or more layers of
transformations. MLPs consist of an input layer, an output layer, and, in between those two, one or
more hidden layers. One can look at MLPs as being composed of more than one Perceptron. For each
input, the MLP applies a linear transformation by taking the dot product of the input and the set of
weights 𝑊 𝑖 between the input layer and the hidden layer. These values are summed with the bias
term 𝑏 𝑖 and passed through an activation function for every node in the hidden layer. Once the
output of every node in the hidden layer is calculated, these values are pushed to the next hidden
layer or to the output layer, which takes the output from the last hidden layer, performs similar
computations to the ones done in the hidden layers, and returns the output values of the network.
2.1.3. Backpropagation
The set of Weights 𝑊 𝑖 and bias terms 𝑏 𝑖 used in-between layers are the MLP parameters
𝜃𝑚𝑜𝑑𝑒𝑙 . To train this model and improve the network performance for a certain task, the network
should be able to learn to adjust 𝜃𝑚𝑜𝑑𝑒𝑙 . Backpropagation (Linnainmaa, 1976; Rumelhart et al., 1986;
Werbos, 1974) is a widely used algorithm to train neural network models. This method computes the
gradient (derivative) of the loss function 𝐿(𝑦̂, 𝑦; 𝜃𝑚𝑜𝑑𝑒𝑙 ) regarding 𝜃𝑚𝑜𝑑𝑒𝑙 by comparing the output
values 𝛾̂ with the correct answers 𝑦. The weights are adjusted to reduce the loss function. This
iterative process of optimization to find the minimum of the Loss function is called the Gradient
Descent (GD) algorithm (Cauchy, 1847; Curry, 1944) and it is described in Algorithm 1:
8
𝜃𝑚𝑜𝑑𝑒𝑙 ← 𝑎𝑛𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑠𝑝𝑎𝑐𝑒
𝒘𝒉𝒊𝒍𝒆 𝐿(𝑦̂, 𝑦; 𝜃𝑚𝑜𝑑𝑒𝑙 ) > 𝜖 𝒅𝒐
𝒇𝒐𝒓 𝑤𝑖 ∈ 𝜃𝑚𝑜𝑑𝑒𝑙 𝒅𝒐
𝑑
𝑤𝑖 ← 𝑤𝑖 − 𝛼 𝑑𝑤 𝐿(𝑦, 𝑦; 𝜃𝑚𝑜𝑑𝑒𝑙 )
𝑖
𝒆𝒏𝒅 𝒇𝒐𝒓
𝒆𝒏𝒅 𝒘𝒉𝒊𝒍𝒆
Algorithm 1 has a hyper-parameter4 learning rate 𝛼, used to define how much 𝜃𝑚𝑜𝑑𝑒𝑙 should
be changed by each iteration. Besides allowing non-linearity for the predictions, non-linear functions
also allow for the gradients of the functions to be dependent on the input value, while linear
functions have a constant gradient. The number of 𝑦 and 𝑦̂ used to compute the loss function can
differ, leading to different GD versions. When the algorithm uses only one sample at a time to
compute the loss function, we call this process Stochastic Gradient Descent (SGD) (Kiefer &
Wolfowitz, 1952; Robbins & Monro, 1951) . On the other hand, if the algorithms evaluate every
training example at each step and only then updates the network parameters, the process is called
Batch Gradient Descent. SGD tends to be computationally faster than Batch GD, but its higher
number of updates can result in noisy gradients.
A third approach that has been increasingly used is the Mini-Batch GD, which separates the
training set into small batches and updates 𝜃𝑚𝑜𝑑𝑒𝑙 for each of those batches, creating a balance
between both SGD and Batch GD approaches. Variants of the SGD that can work with full batches or
mini batches are the Root Mean Square Propagation (RMSProp) and the Adaptive Moment
Estimation (Adam) (Kingma & Ba, 2015).
One of the most prominent problems present in neural networks with a considerable number of
hidden layers (deep neural networks) is the lack of control over the learning process. Even though
the high number of parameters and freedom to learn are the main characteristics that enable neural
networks to fit complex problems, they also come with the overfitting drawback. Overfitting occurs
when the model closely fits the training data but has difficulty generalizing its learning to unseen
data examples. For networks like the MLP that have complex hidden structures, the ability to extract
features from the training data can lead the model to learn irrelevant features that represent
randomness present in the training dataset. The model will then make predictions based on that
noise, which will not hold for new data.
A method that allows to check if a model is overfitting is to divide the training set into three
parts – train set, validation set, and test set. The percentage of training instances that should be held
by each set depends on the size of the whole dataset and complexity of the model, although a
common division is a split of 60%, 20%, and 20%, respectively. The model learns from the train set
and uses the validation set to track progress for each learning epoch to optimize its performance. As
for the test set, it is used after the model is trained to measure its performance and check for
overfitting in case the error rate on the validation set is much lower than the one on the test set.
4
A hyper parameter is a parameter whose value is set before the learning process begins
9
One common technique to face an overfitting model is to increase the amount of training
data. This will give the model the ability to learn from a training set that is a better generalization of
unseen data. Nevertheless, it is not always the case that more data is available to train the model.
Another option is to use another data set splitting technique, such as k-fold Cross-validation
(Mosteller & Tukey, 1968), which uses a training and validation set split, and for each k learning
iteration changes the instances that are in each split, averaging the validation results over the
iterations.
Regularization is yet another common method used to avoid overfitting, and it consists of
constraining the complexity of a model by adding a regularization term to the loss function. An
immensely popular regularization method used to prevent neural networks from overfitting is the
dropout (Srivastava, Hinton, Krizhevsky, & Salakhutdinov, 2014). Although dropout is considered a
regularization method, it does not directly add a term to the loss function, instead, every unit of a
neural network (except for the output units) has a probability 𝑝 of being ignored during the learning
process. This will result in a smaller network compared to the original one, called a thinned network.
For every learning sample, different nodes are dropped, and new thinned networks are used
for training. This means that, for each iteration, a node will receive different combinations of inputs.
At test time, dropout is not used - a single network is used for training. The outgoing weights of each
unit are multiplied by 𝑝 to ensure the output of a hidden unit in test time corresponds to the one at
training time.
The advantage of this method is that it allows to combine several neural network models
that share the same set of hyperparameters and can be trained without additional computation size,
increasing generalization, and avoiding overfitting.
Another type of neural network is the Convolutional Neural Network (CNN). The simplest form of a
CNN architecture was proposed by Fukushima (1980), inspired by previous discoveries about the
visual cortex of mammals. After that, other refinements modified CNNs, such as backpropagation
training (LeCun et al., 1989). The popularity and application of CNNs comes mostly from tasks in the
areas of image and video recognition, although its use has been increasing across NLP tasks.
The network is most typically composed of a Convolutional layer, a Pooling layer, and a Fully
Connected layer (another name for MLP). The Convolutional layer, which is the building block of a
CNN, has as parameters 𝜃𝑚𝑜𝑑𝑒𝑙 a set of learnable filters (kernels) that focus on a patch of the input at
the time and, during the forward pass of the algorithm, take the dot product between their values
and the input patch values, in a process called convolution, producing a matrix called activation map
and reducing the dimensionality of the input. Applying this idea to a NLP task with a one dimensional
convolutional, we define an input sentence composed of words as 𝑥 = {𝑥𝑖 , 𝑥𝑖+1 , . . . , 𝑥𝑖+ℎ }.
Let 𝑥𝑖:𝑖+ℎ−1 be a window of words, the Convolutional layer filter 𝑤 is applied to the window of words
generating a new feature 𝑐𝑖 , as follows:
Where 𝑏 is a bias term and 𝑓 a non-linear function, such as ReLU, that introduces non-
linearity to the network. This filter is applied to each combination of words in the input sentence,
10
with the same window size, producing a rectified feature map 𝑐 = [𝑐1 , 𝑐2 , … , 𝑐𝑛 ]. For a 𝑛 number of
filters, varying in window size, 𝑛 feature maps are obtained. After computing the feature maps, these
are passed to a pooling layer, typically max pooling, that reduces the size of the representation by
taking the maximum value 𝑐̂ = 𝑚𝑎𝑥{𝑐} of each map. This way, the model can capture the most
important feature for each filter. By reducing the feature size, it reduces the number of parameters,
the amount of computation, and controls overfitting.
Finally, the features 𝑐̂ are passed to the Fully Connected (FC) Layer. Inside the FC layer, the
computations are identical to the ones done inside a Perceptron or an MLP, with a softmax activation
function in the output layer that produces the distribution of the input over classification classes.
Being 𝑝 = {𝑐̂𝑖 , … , 𝑐̂𝑖+𝑛 } the set of features passed to the FC layer, the output 𝑂 is obtained by
applying the softmax 𝑓 over the sum of the dot product between the FC layer weights 𝑊 and the
input values 𝑝:
𝑧 = 𝑠𝑢𝑚(𝑊𝑝)
𝑂 = 𝑓(𝑧) (2.7)
MLPs and CNNs are types of networks called feedforward neural networks, where each neuron in
one layer has only direct connections to the neuron in the next layer. This means that the
information moves only forward, from the input, through the hidden nodes, and to the output
nodes. This characteristic can result in limitations regarding their ability to process sequential data
such as sequences of words. In this respect, Conditional Random Fields (CRF) (Lafferty, McCallum, &
Pereira, 2001) are an example of discriminative models that are suitable to process sequences of
data.
CRF is used for structured prediction5, being graphically modeled – modeled in a way that
enables the conditional dependence structure between variables to be expressed. A popular type of
CRF is the linear chain CRF, which considers sequential dependencies between the model’s
predictions. The aim of a linear chain CRF is to calculate the conditional probability 𝑃(𝑦|𝑥) of an
output sequence 𝑦 given the input sequence 𝑥. Thus, to predict the output sequence we extract the
sequence with the highest probability. The CRF formula to calculate 𝑝(𝑦|𝑥) can be broken down into
two components - a weights and features component and normalization component:
𝑛
1
𝑝(𝑦|𝑥, 𝜆) = 𝑒𝑥𝑝 ∑ ∑ 𝜆𝑗 𝑓𝑖 (𝑥, 𝑖, 𝑦𝑖−1 , 𝑦) (2.8)
𝑍(𝑥)
𝑖=1 𝑗
Where 𝑍(𝑥) is the normalization component that sums all combinations of state sequences
until the total is 1. This transformation turns the output into a probability:
𝑛
5
Supervised machine learning technique used to predict structured objects (sequences), instead of
scalar discrete or real values
11
From equation 2.9, in the features function 𝑓𝑖 (𝑥, 𝑖, 𝑦𝑖−1 , 𝑦), 𝑥 represents the set of input
vectors, 𝑖 the position of data points we want to predict, 𝑦𝑖−1 the label of the data point 𝑖 − 1 and 𝑦𝑖
the label of data points 𝑖 in 𝑥. The 𝜆 parameter represents the weights of the feature function and it
is estimated using a maximum likelihood method. To train the model, i.e optimize the parameters, an
iterative function like the Gradient descent is used.
A type of neural network that is a commonly used method to process sequential data is the
Recurrent Neural Network (RNN) (Elman, 1990). This network has recurrent connections between
layers, increasing its ability to process sequences of arbitrary length as input. This ability is enabled
using a feedback connection to store information over time steps, known as the internal state. This
way, the RNN is constituted of a feedforward layer and a recurrent layer. Its most simple form, the
Vanilla RNN, can be expressed with the following equation:
ℎ𝑡 = 𝑔(𝑥𝑡 𝑊 𝑥 + 𝑠𝑡−1 𝑊 𝑠 + 𝑏)
𝑠𝑡 = ℎ𝑡 (2.10)
From equation 2.10, the RNN works by receiving an input 𝑥𝑡 together with a state vector
𝑠𝑡−1 , at each time step, that is linearly transformed by the weight matrices 𝑊 𝑥 and 𝑊 𝑠 . The linear
transformations along with the bias term 𝑏, similarly to the MLP, are passed through an activation
function (e.g. Tanh) that produces a hidden state ℎ𝑡 . Then, the computed hidden state for the last
time step is used as a state vector 𝑠𝑡 for the next input 𝑥𝑡+1 . Figure 2.1 depicts an unfolded RNN.
In theory, RNNs should be able to process and represent information of an entire sentence
with long-term dependencies between its constituents, but in practice, this is not true since
backpropagation through time may not work. In other words, when an RNN is learning to store
information through time, the values of the gradients become so small that the model stops learning
or takes too much time to learn, a problem known as vanishing gradients.
A refined form of the RNN, the Long Short-Term Memory Networks (LSTM) (Hochreiter &
Schmidhuber, 1997), was proposed to address the problem of the vanishing gradients. The network
architecture is designed to learn long-term dependencies. The LSTM holds a cell state that carries
information across the different time steps of the sequence, receiving minimal updates based on
three different gates, forget, input, and output, that control the information held by the cell state.
Equation 2.11 shows the functions for the LSTM.
12
𝑓𝑡 = 𝜎(𝑊𝑓 [ℎ𝑡−1 ; 𝑥𝑡 ] + 𝑏𝑓 )
𝑖𝑡 = 𝜎(𝑊𝑖 [ℎ𝑡−1 ; 𝑥𝑡 ] + 𝑏𝑖 )
𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ 𝑐̂𝑡
𝑜𝑡 = 𝜎(𝑊𝑜 [ℎ𝑡−1 ; 𝑥𝑡 ] + 𝑏𝑜 )
ℎ𝑡 = 𝑜𝑡 ⊙ tanh(𝑐𝑡 ) (2.11)
The network receives a new input 𝑥𝑡 , the previous cell state 𝑐𝑡−1 and the previous hidden
state ℎ𝑡−1 , for each time step 𝑡. The forget gate 𝑓𝑡 , a sigmoid layer, looks at ℎ𝑡−1 and 𝑥𝑡 and
computes an output between 0 and 1, consisting of what at that time step should be erased from 𝑡 −
1. Similarly, the input gate 𝑖𝑡 , another sigmoid layer, decides what information from the input should
be kept. Next, a tanh layer computes a vector 𝑐𝑡 , of new candidate values and the forget gate 𝑓𝑡 is
multiplied with the cell state from the previous step 𝑐𝑡−1 . This value is then summed with the
multiplication of the values coming from the input gate 𝑖𝑡 and 𝑐̂𝑡 . This process enables the network
to store new information. The output of the current step is given by the output gate 𝑜𝑡 , and the
hidden state used at 𝑡 + 1 is given by ℎ𝑡 . This process is illustrated in Figure 2.2.
Like a simple RNN, LSTMs can also be trained via gradient descent, but in this case, the model
parameters are the Weight matrices and bias terms for each gate and candidate cell calculation
function: 𝜃𝑚𝑜𝑑𝑒𝑙 = {𝑊𝑓 , 𝑊𝑖 , 𝑊𝑐 , 𝑊𝑜 , 𝑏𝑓 , 𝑏𝑖 , 𝑏𝑐 , 𝑏𝑜 }.
Cho et al. (2014) introduced a variation of the traditional LSTM, called the Gated Recurrent
Unit (GRU). This model considers only two gates - the update gate 𝑧𝑡 and the reset gate 𝑟𝑡 . The GRU
does not have different cell states - for each time step the memory is completely exposed.
The mentioned RNN types only have access to the past context when dealing with an input
sequence. In some situations, it might be useful to access the future context or both the past and
future contexts. Graves & Schmidhuber (2005) introduced the Bidirectional Recurrent Neural
Networks. In this type of architecture, one RNN model processes the sequence forward, and another
13
independent RNN model processes the same sequence backward, merging the forward and
backward outputs for each time step to combine past and future information. Bidirectional LSTM
(BiLSTM) and Bidirectional GRU (BiGRU) follow the same idea, applying independent LSTMs or GRUs
that process the input sequence in opposite directions.
Sutskever, Vinyals, & Le (2014) proposed the application of sequence-to-sequence models – models
that receive a sequence of items and outputs another sequence of items – for machine translation.
These models are composed of an encoder and a decoder phase. The encoder processes an input
sequence and outputs a context vector. The decoder receives the context vector outputted by the
encoder and produces the final output sequence item by item. Both the encoder and decoder tend
to be composed of multiple RNNs.
The context vector outputted by the encoder turned out to be a bottleneck as it makes it
difficult for the model to process long sentences. To face this issue Bahdanau, Cho, & Bengio (2015)
introduced a technique called Attention, that allows the decoder phase of the model to focus on
relevant parts of the input sequence in regard to the present word.
From equation 2.12, the attention decoder hidden state ℎ𝑡𝑑 of the 𝑡𝑡ℎ decoding step looks at
the encoder hidden state ℎ𝑘𝑒 of every 𝑘 𝑡ℎ word in the source, computing a softmax distribution that
results in a weighted sum of the encoder hidden states capable of retrieving relevant information
from the source input. This can be seen in equation 2.13:
exp(𝑠𝑐𝑜𝑟𝑒(ℎ𝑡𝑑 , ℎ𝑖𝑒 ))
𝑎𝑡 (𝑖) = (2.13)
𝛴𝑗𝑚 exp(𝑠𝑐𝑜𝑟𝑒(ℎ𝑡𝑑 , ℎ𝑗𝑒 ))
The Attention mechanism was further developed by other studies which introduced
variations such as calculating the 𝑠𝑐𝑜𝑟𝑒(ℎ𝑡𝑑 , ℎ𝑖𝑒 ) by taking the dot product of ℎ𝑡𝑑 and ℎ𝑖𝑒 (Luong,
Pham, & Manning, 2015).
Vaswani et al. (2017) proposed a novel architecture called Transformer (Figure 2.3), an attention-
based model capable of attending the entire input sequence at once and that is capable of improving
performance in tasks such as machine translation.
Similar to the models proposed by Bahdanau et al. (2015) and Luong et al. (2015) the
Transformer makes use of an encoder and decoder architecture, but, in this case, the encoder and
decoder components are stacks of encoders and decoders, respectively.
Each encoder is composed of multi-head attention, which serves the same purpose of a simple
attention layer, to look at other words in a sentence while processing a single word, and a
feedforward neural network. The output of each encoder is sent upwards to the next encoder in the
stack of encoders. The last encoder sends its outputs to each decoder in the stack of decoders. Each
decoder is formed by the same two sub-layers present in the encoder, nonetheless, modifying the
14
multi-head attention to prevent positions of attending subsequent positions. Additionally, the
decoder has a third sub-layer that performs attention over the output of the encoder stack. The last
decoder outputs a vector that enters a FC neural network followed by a softmax layer that turns the
output vector of the FC network into values resembling probabilities where the vector cell with the
highest probability represents the output word for the current time step.
Figure 2.3 – The architecture of the Transformer Model. Taken from https://tinyurl.com/y2jvw3m3.
The Transformer architecture has since become popular in a wide range of NLP tasks, such as
sequence classification, question answering and named entity recognition (Wolf et al., 2020).
Bidirectional Encoder Representations from Transformers (BERT) (Devlin, Chang, Lee, & Toutanova,
2019) and its variations, are some of the most notable types of transformers, having set state-of-the-
art performances in various NLP tasks.
15
2.2.1. Sparse Models
One of the simplest approaches to represent a textual document is the Bag-of-Words (BoW). In the
BoW model, each document is represented as a word-count vector, describing the occurrences of a
word6 𝑤, belonging to vocabulary 𝛤 with a finite size 𝑛, within a document ⅆ𝑖 . Additionally, a feature
weight can be assigned to each word, showing the importance of a word in a document. This way,
each document ⅆ𝑖 , from a collection of documents 𝐷, is represented by a feature vector 𝑣𝑖 =
{𝑤1,𝑖 , 𝑤2,𝑖 , … , 𝑤𝑛,𝑖 } where 𝑤𝑗,𝑖 corresponds to the weight of the word 𝑤𝑗 ∈ 𝛤 in the document ⅆ𝑖 ∈
𝐷. In its simplest form, the process of representing the weights 𝑤𝑗 from a document, is to consider
binary representations. In this case, the value for 𝑤𝑗,𝑖 is 1 if 𝑤 ∈ 𝛤 is present in ⅆ𝑖 , and 0 if it is not.
Model sparsity is the idea that given a large set of features, many dimensions of model
parameters are not needed for the task at hand. In a BoW, the size of 𝑣𝑖 is equal to 𝑛. Thus, if in the
presence of a lengthy vocabulary size 𝑛 and a variety of documents not similar in regards to the
words present in each, most of the feature vectors 𝑣𝑖 are likely to be sparse, as they contain several
𝑤𝑗,𝑖 with value 0, making the BoW feature matrix sparse as well. The BoW approach allows
comparing documents based on the presence of words, which is useful to calculate document
similarities. To do so, one can use distance metrics, such as the Cosine Similarity. Based on the
similarities, we can perform other tasks such as document classification.
Nevertheless, BoW representations do not consider word order. One way to address this
problem is to consider n-grams sequences instead of single words. An n-gram is a contiguous
sequence of 𝑛 items from a document ⅆ𝑖 . The discussed BoW model used n-grams of 𝑛 = 1, where
each vocabulary item was a single word. If two words are grouped and constitute a single unit in the
vocabulary 𝛤, we are considering a bigram, if 𝑛 = 3 then each vocabulary item is a trigram. While
this approach may extract more meaningful representations from the documents as it considers, in a
simplified way, word interactions, it falls short to understand similarities between documents that
have the same words but with different arrangements or synonym words. Additionally, this
representation does not assign weights for the words, besides 1 and 0, giving the same importance
to all words that are present in a document, while in theory, some words may be more important
than others.
A commonly used technique to assign weights to each word based on their importance in the
scope of a given document is the Term Frequency-Inverse Document Frequency (TF-IDF). Term
Frequency (TF) is the output of a BoW model for a specific document ⅆ from a set of documents 𝐷,
counting the times a word 𝑤 appears in ⅆ ∈ 𝐷. The inverse document frequency (IDF), is the weight
of a word across all documents. The rarest a word occurs in 𝐷, the higher its IDF score is. The product
of these two concepts results in the TF-IDF score, used to measure how important a word is to a
document. Each word will have a higher score for a certain document if it appears multiple times in
that document and it rarely occurs in other documents. The TF-IDF approach is easy to compute and
gives a vector representation able to better capture similarities between two documents. On the
other hand, this model is still based on the BoW, not expressing, for example, complex semantic
features.
6
Besides words other textual tokens, such a punctuation marks and numbers, can be considered.
16
Like documents, words can also be represented as vectors. In this case, we can use one-hot
encoding to make one-hot vector representation of words, where a word 𝑤𝑖 ∈ 𝛤 is represented as a
vector 𝑣𝑖 , with value 1 for the position in the vector that represents 𝑤𝑖 and all the other entries,
which represent the other words present in the vocabulary, with value 0. Another way to represent
words as vectors is to create a term-document matrix, in which each row represents a word and each
column a document. Each cell of the matrix records how many times a word appears in the
document. This way, row vectors provide the word representations. A similar matrix that considers
term to term, instead of term to document, can also be achieved by considering the number of times
two words appear in the same context, which can be an entire document or a window of words. The
word representation extracted from such a matrix, is a vector given by the row or column for that
word. Mirroring the idea behind the document matrix, where we consider similar documents as
having similar words, with this approach, we can consider that similar words tend to appear in similar
contexts.
Furthermore, instead of using words as the minimal unit for textual representation, one can
resort to subword units. Subwords are divisions of words, for example, the word “subword” can be
divided into the pair “sub” and “word”, each having different vector representations. Subword
representations allow to overcome out-of-vocabulary (OOV) words – if a word is not present in a
vocabulary, its constituent subword units most likely are. Moreover, even if a word is OOV, subwords
permit to guess their meaning by extracting morphemes that may allow to perceive some
characteristic of a word. Different algorithms for subword extraction have been proposed (Kudo,
2018; Mikolov et al., 2012; Nießen & Ney, 2000). One of the most popular, proposed by (Sennrich,
Haddow, & Birch, 2016), uses byte-pair-encoding (BPE) (Gage, 1994), a data compression technique
used to represent the most frequent pair of bytes in an sequence as a single byte. Applied to
subword extraction, BPE works by counting all character pairs and replacing each occurrence of the
most frequent pairs with the correspondent subword unit. This approach is used by Google in their
unsupervised text tokenizer SentencePiece7.
The methods exposed in the previous section generate sparse vectors that tend to become
extremely high dimensional depending on the size of the vocabulary. In this section, we will discuss
how word vectors can be achieved through dense representations of words as a low dimensional
real-valued vector, a concept defined as word embeddings. In other words, a word embedding is a
learned representation of a textual instance in the form of feature vectors of a predefined length, in
which each vector entry stands for a hidden feature of the word. Word embeddings are useful
because they can reduce the dimensionality of word representations, while, at the same time,
providing better generalization and extracting semantic relations.
With the rise of deep learning algorithms, such as the ones mentioned in Section 2.1, these methods
have also been applied for the calculation of word embeddings, which are then used for NLP tasks
such as sequence classification (also called sequence labelling) (Collobert & Weston, 2008). The
neural networks are trained with the aim of minimizing the distance between words that occur in
7
https://github.com/google/sentencepiece
17
similar contexts. This process can be done jointly with the neural network model on a supervised task
– in which the embeddings from the parameters of the network (Embedding Layer) adjust to
minimize the loss, resulting in vector representations of words where similar words have similar
representations.
Word embeddings can also be achieved through an unsupervised process upon a given text
corpus. Mikolov, Chen, Corrado, & Dean (2013) proposed word2vec8, an unsupervised method
capable of generating word embeddings through two different methods – the Continuous Bag-of-
Words (CBOW) and the Skip-Gram. The CBOW aim is to predict a target word 𝑤𝑡 based on the 𝑛
words before and after 𝑤𝑡 . The objective function of the CBOW can be defined as:
𝑇
1
𝐽𝜃 = ∑ log 𝑝( 𝑤𝑡 |𝑤𝑡−𝑖 , 𝑤𝑡−𝑖+1 , … , 𝑤𝑡+1 , 𝑤𝑡+𝑖 ) (2.14)
𝑇
𝑡=1
Where 𝑤𝑡 is the middle word in a context defined by the preceding and posterior 𝑖 𝑡ℎ words
and 𝐽𝜃 maximizes the probability of 𝑤𝑡 appearing in the given context. The probability
𝑝(𝑤𝑡 |𝑤𝑡−𝑖 , 𝑤𝑡−𝑖+1 , … , 𝑤𝑡+1 , 𝑤𝑡+𝑖 ) is given by a softmax function over the surrounding words. The
Skip-Gram model, on the other hand, uses the center word to predict the surrounding words. It sums
the log probabilities of the surrounding words trying to maximize the following function:
𝑇
1
𝐽𝜃 = ∑ ∑ log 𝑝(𝑤𝑡−𝑖 , |𝑤𝑡 ) (2.16)
𝑇
𝑡=1 −𝑖≤𝑡≤𝑖,≠0
Again, the probability is given by softmax function, which in this case, uses 𝑤𝑡 to predict the
probability of the surrounding word 𝑤𝑡−𝑖 . Due to the softmax output being equal to the number of
words in the vocabulary, this output can be a bottleneck regarding computational cost. Despite this,
word2vec has been shown to be a more computational efficient task compared to methods that use
connected hidden layers.
To reduce the computational weight of an output softmax layer, Mikolov, Sutskever, Chen,
Corrado, & Dean (2013) proposed a similar architecture to the word2vec models diverging in the use
of a sigmoid function as the output layer. The task of generating word embeddings was redefined as
a logistic regression, in which the model receives pairs of words and predicts if they belong to the
same context. This process is called negative sampling and produces a decrease in computational
costs. Furthermore, Pennington, Socher, & Manning (2014) proposed GloVe9. Like in word2Vec with
negative sampling, training is performed taking advantage of a co-occurrence matrix, but in Glove’s
case, it uses the ratios of co-occurrences probabilities of two words with a third one.
By exploiting vast amounts of textual data, word2vec (Mikolov, et al., 2013) and GloVe
(Pennington et al., 2014) capture word semantics and word arithmetics, i.e. word meaning can be
exposed through operations on word vectors, for example, words that have semantic relationships
like country-capital, male-female, or singular-plural. This way, dense models are not only more
8
https://code.google.com/archive/p/word2vec/
9
https://github.com/stanfordnlp/GloVe
18
efficient with regards to their parameter size, but also regarding their ability to extract complex
features from the words based on their distribution (distributional hypothesis). However, these
representations fall short considering that words can have different meanings depending on the
context (word polysemy). Because these representations are fixed, we are ignoring the different
meanings a word can have depending on the context it is used, which can impact the performance of
the downstream system receiving the embeddings.
(Peters et al., 2018) introduced a new type of deep contextualized word representation
derived from pre-trained bidirectional models - Embeddings from Language Models (ELMo). ELMo
embeddings convert each token (textual instance) into a character representation using character
embeddings. The use of character level embeddings resembles the use of subwords, as character
representations also allow, to extract hidden morphological features and eliminate the OOV words.
Character embeddings are then pushed through a CNN layer, that allows to pick up n-gram features
using the CNN filter. The CNN over the characters computes word-level embeddings, which are
passed through a BiLSTM model. The internal representations of these models, i.e. the states of the
LSTM, capture context-dependent aspects of word meaning, which are optimized with a bidirectional
language model, making ELMo embeddings a function of the entire input sequence, instead of a fixed
vector for each word.
BERT (Devlin et al., 2019) is another method able to produce contextual embeddings, that
relies on two previously developed ideas – Attention Mechanism (Bahdanau et al., 2015) and the
Transformer (Vaswani et al., 2017). The model can be understood as a trained Transformer encoder
stack, with 12 encoder layers for the BERT-Base variation or 24 encoder layers for BERT-Large, each
with feedforward neural networks with 768 and 1024 hidden-nodes, respectively (Figure 2.4).
Figure 2.4 – BERT-Base and BERT-Large Encoder Stack. Taken from https://tinyurl.com/y3werbz7
19
BERT’s training can be divided into two tasks: 1) Masked Language Modeling (MLM), in which
15% of the tokens in the input sequence are masked, tokens are randomly replaced by other tokens
and the model is asked to predict a masked token. This learning task makes BERT a bi-directional
language model; 2) Next Sentence Prediction, which consists of predicting if a given sentence B is
likely to follow another sentence A. Both these tasks allow BERT to generate a language
representation model. The pre-trained BERT can then be used to generate contextualized word
embeddings for a given corpus, which can be fed to a downstream task, or can be fine-tuned for a
specific task such as question answering, language inference, or named entity recognition. Besides
BERT-Base and BERT-Large, variations of BERT trained with the different corpus have been
released10.
In their research to measure the impact of key hyperparameters and training data size on
BERT training, Liu et al. (2019) observed that BERT was undertrained. Among other changes, the
authors removed the next sentence prediction objective, dynamically changed the masking pattern
employed to the training data and used a novel dataset for pre-training. This new configuration of
BERT was defined as Robustly optimized BERT approach (RoBERTa) and showed state-of-the-art
results when tested on the same tasks used to test BERT and other Transformer based models, as
well as performance improvements for downstream tasks that use RoBERTa to create contextualized
word embeddings.
Recent works with Transformer-like models have shifted their attention to learning
multilingual (cross-lingual) representations of text. Among them, the Multilingual BERT (Devlin et al.,
2019), which is trained with corpora from multiple languages. Later, other cross-lingual Transformer-
like models used more pre-training data or other training objectives. Lample & Conneau (2019)
proposed Cross-lingual Language Models (XLM), introducing a new supervised task for learning cross-
lingual representations, the Translation Language Model, in which words are masked in a source
sentence (e.g. in English) and target sentence (e.g. in French) and the model can attend to
surrounding English words or to the French words from the target sentence (translation). This
method encourages the model to align the English and French representations.
Moreover, Conneau et al. (2020) mixed RoBERTa (Liu et al., 2019) and XLM (Lample &
Conneau, 2019) approaches, introducing XLM-RoBERTa (XLM-R), a Cross-lingual RoBERTa
transformer pretrained on text from 100 languages, showing performance gains in a range of
multilingual transfer tasks when compared to other multilingual transformers. In conclusion, these
cross-lingual Transformers can learn language models in multiple languages boosting the
performance of models on monolingual and cross-lingual classification and on unsupervised and
supervised machine translation.
10
Original BERT repository is available from https://github.com/google-research/bert
20
21
3. RELATED WORK
In this chapter, we address the email zoning related work (section 3.1) and related work in the area
of text segmentation (section 3.2). Chronologically, we discuss the evolution of the email zoning
approaches, taxonomies and the different models applied, while also summarizing the existing email
zoning corpora (section 3.1.5).
Chen, Hu, & Sproat (1999) were the pioneers in the topic of text segmentation directed to
email text, but their work focused only on the identification of signature zones. Signature text blocks
are email parts that can usually be found at the end of the email, containing automatically inserted
personal data. Signature blocks are a major source of information, as they usually contain details
about the sender, such as email address, web address, telephone number, name, postal address,
etc., and can be used for tasks such as construction of a client database or message retrieval. The
authors looked at linguist patterns and geometrical patterns inside previously identified signature
zones. Linguistic patterns are related to the lexical constraints found in the text, which are common
in email and web addresses, but less visible in other text passages such as names and addresses. As
for the geometrical patterns, they indicate the reading sequence of a signature block. The
researchers tested their model in 1361 signature blocks collected from emails from the Department
of Computer Science at Concordia University12 and from their own personal emails, claiming that the
system achieves a Recall of 53% and a Precision of 90% in the task of signature identification.
3.1.1. JANGADA
Similarly to Chen, Hu, & Sproat (1999), Carvalho & Cohen (2004) developed JANGADA13, a system that
attempts to identify signature blocks and quoted text from previous emails. Figure 3.1 illustrates a
labeled email message, with reply (quoted text) and signature lines identified.
11
https://www.ldc.upenn.edu/
12
https://www.concordia.ca/ginacody/computer-science-software-eng
13
https://www.cs.cmu.edu/~vitor/codeAndData.html
22
Zone Line
other From: [email protected]
other To: Vitor Carvalho
other Subject: Re: Did you try to compile javadoc recently?
other Date: 25 Mar 2004 12:05:51 -0500
other Try cvs update –dP, this removes files & directories that have been deleted from cvs.
other -W
other
reply On Wed, 2004-03-24 at 19:58, Vitor Carvalho wrote:
reply > I just checked-out the baseline m3 code and
reply > "Ant dist" is working fine, but "ant javadoc" is not.
reply > Thanks
reply > Vitor
other
signature ------------------------------------------------------------------------------------------------------------------------------------
signature William W. Cohen “Would you drive a mime
signature [email protected] nuts if you played an
signature http://www.wcohen.com audio tape at full
signature Associate Research Professor blast?” ----
signature CALD, Carnegie-Mellon University S. Wright
Figure 3.1 – Example of a JANGADA labeled email message adapted from Carvalho & Cohen (2004).
The system starts by classifying if an email contains signatures or reply lines. Then, for the
selected emails, it classifies each line using CRF (Lafferty et al., 2001) and sequence-aware
perceptrons (Collins, 2002), based on line self-features and features from surrounding lines, such as
presence of a quote sign “>”, presence of the name of the message sender, presence of email
address patterns, presence of other relevant characters and punctuation patterns and number of
leading tabs. JANGADA was trained and tested with English emails from the 20 News-groups14 (Lang,
1995), and reported accuracy reaches 98.91%. Later, Lampert et al. (2009) tested the performance of
the JANGADA system in the Enron email corpus15 (Klimt & Yang, 2004), reporting that JANGADA detects
less than 10% of reply and forward lines. Nonetheless, Repke & Krestel (2018) showed that slight
modifications applied to JANGADA to fit different tasks lead to an accuracy similar to the one claimed
in the original research.
After JANGADA, Tang, Li, Cao, & Tang (2005) proposed an email data cleansing system based
on a Support Vector Machine (SVM) (Cortes & Vapnik, 1995), that aimed at filtering the non-textual
noisy content from emails based on hand-coded features present in the email zones textual content.
The authors collected email from different newsgroups held by Google, Yahoo or Microsoft and
extended previous email zone schemas, considering header, signature, quotation, program code,
and table zones. Reported zone detection f1-score reaches 97.76% for header, 89.88% for signature,
95% for quotation, 81.26% for program code and 91.19% for paragraph.
14
http://people.csail.mit.edu/jrennie/20Newsgroups/
15
http://www.cs.cmu.edu/~enron/
23
In their author profiling and author identification task, Estival et al. (2007) resorted to
recruited respondents donated email messages16, parsing each email and classifying text segments
into five categories: author text, signature, advertisement, quoted text, and reply lines. The
segmentation of emails is a crucial part of their work since features for author attribution and
profiling can be found in the author text zones, which represent around 82% of the total number of
words in their email corpus. They compared a range of ML algorithms together with feature selection
to classify each line in an email, attaining improvements in the end task of author profiling. The
authors also compared the performance of their model with JANGADA, reporting an accuracy of 88%
with a three-zone classification (author text, signature, and reply lines), against 64% accuracy of
JANGADA.
3.1.2. ZEBRA
Recognizing the lack of a common syntax and informal structure of emails, Lampert et al. (2009)
formally defined the email functional parts as email zones, describing the different zones inside email
messages based on graphic, orthographic, and lexical features.
Alongside their definition, the authors refined and extended Estival et al. (2007) classification
schema, considering three zones - sender, quoted conversation, and boilerplate -, extensible to nine
subzones. The sender zones contain text written by the current email sender and can be subdivided
into author, greeting, and signoff. The quoted zones contain reply zones that hold content quoted
from previous messages in the same thread, and forward zones, which consist of forward messages
from other conversations. Finally, the boilerplate zones contain content that is reused without
modification throughout different messages and can be subdivided in signature, advertising,
disclaimer, and attachment. An example labeled message is shown in Figure 3.2.
Furthermore, the authors also proposed ZEBRA17, an email zoning system based on a SVM.
The system was trained with 400 random English email messages18 from an Enron email corpus
(Klimt & Yang, 2004) database dump19.
Zone classification was made following two approaches – zone fragment classification and
line classification. For the zone fragment classification, ZEBRA considers email zones to be composed
of fragments: consecutive email lines that are divided by zone boundaries, such as white space-only
lines. As for the line classification method, it simply classifies lines one-by-one, an approach like the
one seen in JANGADA.
Both fragments and lines are referred to as text fragments. The features used to classify each
text fragment are divided into Graphic Features, Orthographic Features, and Lexical Features.
Graphic features capture information about the layout of the email text considering, for example, the
number of words in the text fragment or the average line length of a text fragment (equal to line
length in Line Classification). Orthographical features capture the use of distinctive characters and
sequences of characters, such as punctuation, capital letters, and number. Examples of
Orthographical features are the percentage of capitalized words in text fragment and whether the
16
Corpus available upon contact with the authors
17
http://zebra.thoughtlets.org/zoning.php
18
http://zebra.thoughtlets.org/data.php
19
https://bailando.berkeley.edu/enron_email
24
text contains an email address. Lastly, Lexical Features aim to encapsulate information about the
words used, resorting to unigrams for each word in the vocabulary, and bigrams to capture short-
range word sequence information. Lexical Features also consider if the sender’s name or recipient’s
name is present on the text fragment or in previous text fragments.
ZEBRA uses the mentioned features as input for the SVM classifier. Reported accuracy reaches
higher values with line classification, achieving 91.53% average accuracy with three zones and
87.01% for nine zones. ZEBRA’s three-zone classification outperforms Estival et al. (2007) system and
Lampert et al. (2009) JANGADA implementation. For a nine-zone classification, when comparing the
performance for each zone, author (89%), reply (91%) and forward (89%) achieve the highest F-
Measure score, while advertising (34%), signature (60%), and disclaimer (60%) get the worst results.
In their posterior work towards detecting emails containing requests for action, Lampert et
al. (2010) aimed at classifying emails based on the presence of action-items. Using ZEBRA as an email
preprocessing task, they were able to segment emails into different functional zones. Then,
25
considering only a small number of zones that had relevant patterns for the classifier, they increased
the accuracy of their request detection task from 72% to 84% accuracy.
After the introduction of ZEBRA, for almost a decade, not much investigation was made on
the topic of email zoning. Talon20, an online library for quotation and signature extraction, became a
popular and easy to implement method for email cleaning. The library provides functions able to
extract quotations and signatures without the use of machine learning algorithms, resorting to
sophisticated pattern matching techniques.
For more complex emails in which the pattern system does not work, Talon also offers a
machine learning solution resorting to the scikit-learn21 library to build SVM classifiers based on
email line features. The classifier was trained with emails from the Enron corpus and conversations
from personal email of the creators of Talon. Additionally, Talon’s machine learning algorithm can be
trained with the user’s dataset.
3.1.3. QUAGGA
As email zoning surpassed its original purpose of signature identification and text cleansing into
a more general task, Repke & Krestel (2018) extended its utility to thread reconstruction. The authors
proposed QUAGGA, a recurrent neural network-based model that aims to recover conversation
threads on single messages by segmenting and classifying email parts into two zones: header and
body.
The header zone consists of blocks of metadata automatically inserted by the email program,
containing information about the sender, recipient, date, and subject of a quoted message. The body
zone considers the text that was written by the author of the current email. A conversation thread is
defined as being a sequence of headers and body blocks. This 2-zone division can be further
extended to a 5-zone classification based on Lampert et al. (2009) ZEBRA’s approach, by dividing the
body zone into greeting, body, signoff, and signature (see example in Figure 3.3).
Inspired by the work on character aware neural language models (Kim, Jernite, Sontag, & Rush,
2016), QUAGGA uses a CNN to encode the characters for each line using the output of this network as
input for a bidirectional GRU network, followed by a CRF layer, which gives the final scores for each
line. This process is illustrated in Figure 3.4. The authors refer that training the CNN separately from
the GRU-CRF gives the best results.
Like Lampert et al. (2009), Repke & Krestel (2018) resorted to the Enron corpus and emails
gathered from public mail archives of the Apache Software Foundation22 (ASF). The authors made
available the annotated dataset for the corpora23, as well as the code to implement QUAGGA 24. Their
annotated dataset considers 800 Enron messages and 500 ASF messages.
20
https://github.com/mailgun/talon
21
https://scikit-learn.org/stable/
22
http://mail-archives.apache.org/mod_mbox/
23
https://github.com/HPI-Information-Systems/Quagga/tree/master/Datasets
24
https://github.com/TimRepke/Quagga
26
2 Zone 5 Zone Line
body body Thank you for your help.
body signature ISC Horline
header 03/15/2001 10:32 AM
header Sent by: Randi Howard
header To: Jeff Skilling/Corp/Enron@ENRON
header cc:
header Subject: Re: My “P” Number
body greeting Mr. Skilling:
body body Your P number is P00500599. For your convenience, you can also go to
body body http://isc.enron.com/ under Site Highlights and reset your password or
body body find your “P” number.
body signoff Thanks,
body signoff Randi Howard
body signature ISC HOTLINE
body body
header From: Jeff Skilling 03/15/2001 10:01 AM
header To: ISC Hotline/Corp/Enron@Enron
header Subject: My “P” Number
body body
body body Could you please forward my “P” number. I am unable to get into the XMS
body body system and need this ASAP.
body signoff Thanks for your help.
Figure 3.3 – Example of a QUAGGA labeled email message with both two- and five-zone annotations,
adapted from Repke & Krestel (2018).
For the mentioned datasets, QUAGGA results are compared with Repke & Krestel (2018)
implementation of JANGADA 25 and ZEBRA 26. For two-zone segmentation, QUAGGA shows an accuracy of
98% on both the Enron and ASF set, revealing a five to three points decrease when considering a five-
zone segmentation. Reported performance values for QUAGGA, are far superior compared to ZEBRA
(accuracy of 25% for two zones and 24% for five zones) which does not reproduce the results of the
original paper. The implementation of JANGADA gets close to the originally reported accuracies (88%
for two zones and 85% for five zones), but it is still outperformed by QUAGGA.
JANGADA and ZEBRA are tested only in English emails and their authors do not affirm the systems
are multilingual. On the other hand, due to its character level approach, Repke & Krestel (2018) claim
that QUAGGA can segment zones in emails from other languages, such as Arabic and Cyrillic.
Nevertheless, no tests were presented to support that claim.
25
https://github.com/HPI-Information-Systems/Quagga/tree/master/Competitors/Jangada
26
https://github.com/HPI-Information-Systems/Quagga/tree/master/Competitors/Zebra
27
Figure 3.4 – QUAGGA model overview. Taken from Repke & Krestel (2018).
3.1.4. CHIPMUNK
Until very recently, email zoning resorted mostly to small samples of mailing lists or newsgroup
corpus and was limited to the English language. Bevendorff, Khatib, Potthast, & Stein (2020) were the
first to crawl emails at scale, resorting to the Gmane email-to-newsgroup gateway27. The authors
annotated 3,033 Gmane emails28 from 31 languages, but with 90% of the emails being in English.
Due to Gmane’s conversations richness in technical topics, the authors developed a more
fine-grained classification schema when compared to previous related work, considering the
segmentation of blocks of code, log data and technical data. Whilst also preserving most of the
common zones introduced in previous works, they ended up with a total of 15 zones: paragraph
(main content, equivalent to body or authored text), salutation (equivalent to greeting), closing
(equivalent to signoff), quotation (forward and reply lines), quotation marker (author and date of a
quotation), inline header (other details of quotation such as the subject), personal signature
(equivalent to signature), mua signature (mostly advertising), raw code (blocks of code), patch
(source code diffs), log data (text referring to logging information), technical (attachments and PGP
signatures), tabular (content in table format), visual separator (sequence of dashes), and section
heading (the title of an paragraph segment). Figure 3.5 shows an example of a Gmane labeled ticket.
Their email zoning model, dubbed CHIPMUNK, consists of a BiGRU-CNN model as depicted in
Figure 3.6. 1.5 million emails were extracted from the main corpus and used to train a fastText
embeddings (Joulin, Grave, Bojanowski, & Mikolov, 2017) of size 100. The BiGRU receives the
embeddings for the current and 𝑛 previous lines, with the value of 𝑛 being up to 12 lines, while
parallelly, a embedding matrix of size 2𝑐 + 1, 𝑛, 100, where 𝑐 is of size 4 and represents the context
window for each line, and 𝑛 is the maximum token count per line, is fed to a CNN of filter 4 × 4 that
performs 128 convolutions and its followed by a 3 × 3 CNN layer performing the same amount of
convolutions. The output of the second CNN is fed into a Max Pooling layer and it is concatenated in
a single vector with the output of the BiGRU for the current and the previous lines. The concatenated
vector receives a regularization dropout of 0.25 and it is passed through a Softmax layer that
generates the outputs.
27
https://news.gmane.io/
28
https://github.com/webis-de/acl20-crawling-mailing-lists/tree/master/annotations
28
Zone Line
salutations Hi Michael,
paragraph Thanx very much for your response to my question. I will keep a look
paragraph out on VITN for any updates. The artwork has been fantastic over the
paragraph years! Thanx so much for all the effort put in!!!
closing kind regards
closing LiveMiles
mua signature Sent from my iPhone
Figure 3.6 – CHIPMUNK model architecture. Taken from Bevendorff et al. (2020).
29
https://github.com/webis-de/acl20-crawling-mailing-lists/tree/master/annotations
29
3.1.5. Email Zoning Public Corpora
As seen throughout section 3.1, several corpora and zoning schemas have been proposed in the
literature under different contexts. Table 3.1 compiles the information of the existing email zoning
corpora.
To the best of our knowledge, Carvalho & Cohen (2004) released the first email zoning
corpus. The corpus consists of 617 emails from the 20 Newsgroups corpus (Lang, 1995), identified
with two zones: signature and quotation. Despite the usefulness of identifying these zones for email
cleansing, the level of detail is insufficient for a general email segmentation. Additionally, Lampert et
al. (2009) noted that the 20 News-groups dataset corpus contains 30 years old emails much more
homogeneous in their syntax than today’s emails. They also point out that JANGADA emails conform
to the RFC3676 Protocol (Gellens, 2004), which states that signature lines are separated from the rest
of the email text with two or more dash marks “--", a convention no longer observed in most email
messages.
Estival et al. (2007) released a corpus of 9,836 recruited respondents donated email
messages42 and introduced a wider annotation schema focusing on more email parts (see section
3.1.1). However, the authors still did not divide the email text into some other relevant zones, such
as greetings, closings nor identify attachments and code lines.
30
http://people.csail.mit.edu/jrennie/20Newsgroups/
31
http://www.cs.cmu.edu/~vitor/codeAndData.html
32
Corpus available upon contact with the authors
33
https://bailando.berkeley.edu/enron_email
34
http://zebra.thoughtlets.org/data.php
35
https://github.com/mailgun/talon
36
https://github.com/mailgun/forge
37
https://github.com/HPI-Information-Systems/Quagga/tree/master/Datasets/Enron
38
http://mail-archives.apache.org/mod_mbox/
39
https://github.com/HPI-Information-Systems/Quagga/tree/master/Datasets/ASF
40
https://webis.de/data.html?q=Webis-Gmane-19
41
https://github.com/webis-de/acl20-crawling-mailing-lists/tree/master/annotations
30
Lampert et al. (2009) were the first to conceptualize the email zoning task and fully define
the characteristics of each identified zone, as well as dividing the authored text into different zones
(see section 3.1.2). They annotated 400 English emails from the Enron email corpus (Klimt & Yang,
2004) database dump. The Enron database dump contains over 600,000 emails from 158 employees
of the Enron Corporation and was made public by the Federal Energy Regulatory Commission while
they were investigating the Enron company, after its collapse. The database is organized by user and,
for each user, the messages are divided into several folders following the email topic.
Also resorting to the Enron database dump, the creators of the Talon library introduced an
open-source corpus of 196 emails with identified signature lines. Similar to the case of Carvalho &
Cohen (2004) corpus, Talon’s annotated corpus is insufficient for a general email segmentation due
to the identification of only one zone - signature.
Repke & Krestel (2018) resorted to the Enron database as well, annotating a total of 800
emails. Reconsidering the task as thread reconstruction, they produced a new annotation schema,
considering a 2-level and a 5-level approach (see sections 3.1.3 and 4.1.2.1). Repke & Krestel (2018)
also annotated 500 emails from the Apache Software Foundation (ASF) using both the 2-level and 5-
level taxonomies. The ASF provides an archive of messages posted to the public mailing lists of the
ASF projects. Mailing lists are used for each of the Apache projects to coordinate the development of
software and administration of the organization. ASF mailing lists are also archived by several other
organizations such as MarkMail43. The first email archive started in February 1995, and following the
description in MarkMail’s website, in 2018 there were 686 active lists, with a total of 22.879.859
messages, accumulating 5.830 messages per day. Repke & Krestel (2018) make use of the ASF
dataset by randomly selecting emails from the flink-user, spark-user, and lucene-solr-user projects
mailing list archives.
Lastly, Bevendorff et al. (2020) introduced the email zoning Gmane corpus. Gmane is an
email-to-newsgroup gateway that allows users to subscribe and participate in mailing lists, while
keeping an archive of email messages from the mailing lists present on the website. Messages date
back to the early 90's until the present day and are organized by groups following the Usenet style44.
While the newsgroup portal is still working, Gmane's website is down for several years. Due to the
uncertainty around the future of Gmane and possibility of historic data loss, Bevendorff et al. (2020)
crawled the 14,699 groups, extracting 153,310,330 usable mails, most belonging to groups of
technical topics (e.g., Linux Kernel Mailing List, KDE bug tracking list). The majority of the emails are
in the English language and the remainder predominantly in German, French and Spanish. From the
crawled emails, Bevendorff et al. (2020) annotated 3,033 emails from 31 languages. The authores
developed a more fine grained classification schema, with a total of 15 zones (see sections 3.1.4 and
4.1.2.2) . Following the same zone taxonomy, they released a set of 300 English emails from the
Enron database dump47.
Overall, email zoning corpora show a great variability of zone taxonomies and most works
have introduced new zones to face the nature of each email source or downstream task. The Enron
database dump has been the most used source to retrieve emails and to build new corpus. On the
other hand, the recent Gmane raw email dump is multilingual and Bevendorff et al. (2020) zone
43
http://apache.markmail.org/
44
https://www.ou.edu/research/electron/internet/use-writ.htm
31
classification schema contains various functional zones, opening the door to new challenges in email
zoning and multilingual methodologies. Moreover, Bevendorff et al. (2020) contribution in making
available the Gmane multilingual raw corpus leaves the opportunity for researchers to generate new
multilingual email zoning corpus and measure model’s capacity for a multilingual email zoning task.
Recent work in text segmentation formulates the problem as a supervised learning task, a
similar approach to the one seen for email zoning. This reflects the increasing number of public
annotated corpus made available in the last years. Koshorek, Cohen, Mor, Rotman, & Berant (2018)
introduced the Wiki-727 dataset45, a collection of 727,746 English Wikipedia articles46. The authors
modeled a hierarchy of two-level networks to predict for each sentence whether it ends a topic-
based segment. The low-level network is composed of a two-layer BiLSTM which, along with a max-
pooling over the LSTM outputs, generates sentence representations. The sentence representations
are fed to the higher-level network, also composed of a two-layer BiLSTM followed by a Softmax
layer. Besides the Wiki-727, the system results were also tested in the Choi dataset (Choi, 2000),
outperforming other models for the first dataset and achieving reasonable performances in the
second.
Similarly, Li, Sun, & Joty (2018) proposed SEGBOT, an end-to-end model based on BiLSTM to
encode the input sequence with another BiLSTM together with a pointer network (Vinyals,
Fortunato, & Jaitly, 2015) to decode and select text boundaries. The system was tested in the RST-DT
dataset (Carlson, Marcu, & Okurowski, 2003) and the Choi dataset (Choi, 2000), achieving high
performances in both.
Also resorting to attention, Wang, Li, & Yang (2018) developed a text segmentation model
based on the BiLSTM-CRF framework (Huang, Xu, & Yu, 2015) and self-attention mechanism
45
https://github.com/koomri/text-segmentation
46
https://dumps.wikimedia.org/enwiki/latest/
32
(Vaswani et al., 2017), resorting to ELMo pre-trained word representations (Peters et al., 2018). The
model was tested on the RST-DT dataset (Carlson et al., 2003) outperforming SEGBOT (Li et al., 2018).
Likewise, Lukasik, Dadachev, Papineni, & Simões (2020) propose three BERT based
architectures (Devlin et al., 2019) for text segmentation. Figure 3.7 illustrates the developed models.
The first model proposed is the Cross-segment BERT (Figure 3.7(a)). This model receives the input
sequence - composed of the sequences of subwords that come before and after a candidate segment
break - and feeds it through a Transformer encoder initialized with the BERT-Large model. Figure
3.7(b) illustrates the second model. Subword representations of each sentence are encoded with the
BERT-Large model. Based on the [CLS] token the model extracts sentence representations, that are
fed into a BiLSTM model that predicts if each sentence represents a segment break. By using a two-
level model, the BiLSTM can capture the dependencies between distant sentences.
Finally, Figure 3.7(c) depicts the third model implemented - the Hierarchical BERT. In this case,
the sentence level BiLSTM is replaced by a BERT-Base model, and the subword level Transformer
used is also initialized with BERT-Base. The Hierarchical BERT model is inspired by HIBERT (Zhang, Wei,
& Zhou, 2020), a model for document summarization in an unsupervised learning fashion. All three
models were compared with SEGBOT (Li et al., 2018), Wang et al. (2018) and Koshorek et al. (2018)
models, setting state-of-the-art performances in the Wiki-727 dataset (Koshorek et al., 2018), Choi
dataset (Choi, 2000) and RST-DT dataset (Carlson et al., 2003). In theory, all three models can
become multilingual if the word or sentence embeddings are generated with a multilingual variation
of the pre-trained BERT (Devlin et al., 2019), or any other cross-lingual transformer model, such as
the XLM-Roberta (Conneau et al., 2020).
33
34
4. METHODOLOGY
Following the concepts and work addressed in sections 2 and 3, this section describes the Corpora
used in this research (section 4.1), defines and explores the systems (models) we developed to
perform the email zoning task (section 4.2) and describes and explain the metrics used to evaluate
email zoning models on the different corpora and to test annotator agreement (section 4.3).
4.1. CORPORA
In this section, we describe and analyze the email zoning corpora used in the scope of this research.
The collected corpora can be divided based on their purpose as follows:
For each corpus, we analyze the overall statistics (e.g. number of emails, number of lines,
etc.) and zone distribution statistics (e.g. number of lines per zone, percentage of total lines per zone,
etc.).
We collected emails from 14 different accounts (CLEVERLY clients) in the English, Portuguese, Spanish,
French and Italian languages. Moreover, by analyzing the corpus together with CLEVERLY business
needs, we produced an email zoning classification schema that best fits the company’s client emails.
In view of the commonalities between zone schemas presented in the related work (section
3.1), we analyzed CLEVERLY’s corpus considering, as well, the company business requirements and the
textual characteristics found in the emails. This way, 8 zones were defined:
• The context details zone represents information that is generally used to contextualize the
details with respect to a certain transaction being addressed in the email message, typically
in a tabular fashion.
• attachment contains automated text in place of attached documents.
• inside the greeting zone we can find the terms of address and recipient names at the
beginning of a message (e.g., Good afternoon/ Dear Mrs.).
• the body zone contains new content from the current email sender.
• the signoff zone contains the message closing the authored part of the email (e.g., Kind
Regards, Mr.).
• signature considers content containing contact or other information that is automatically
inserted in a message. In contrast to signoff, signature content is usually templated content
35
written once by the email author and automatically or semi-automatically included in email
messages. Also contained under signature are automatically generated message lines that
define from which device the email was sent or that advertise some brand (e.g. Sent from
iPhone/ Secured by Avast).
• disclaimer includes legal disclaimers, privacy statements, or other automatically generated
texts advising the people that interact with the email (e.g. Think of the environment before
printing this email).
• finally, quoted conversation zones include both contents quoted in reply to previous
messages in the same conversation thread and forwarded content from other conversations.
Zone Line
context details Order details: NUMBER
context details Email: [email protected]
context details Product Name: PRODUCT
attachment [image1.png]
greeting Hi.
body Made an order with you on WEEKDAY and received this email yesterday (see picture).
body Just want to double check that the order was processed?
body
signoff Thanks.
signoff NAME SURNAME.
signature Sent from my iPhone.
signature
disclaimer --------------------------------------------------------------------------------------------------------------------
disclaimer Important: The contents of this email and any attachments are confidential.
disclaimer They are intended for the named recipient(s) only.
disclaimer If you have received this email by mistake
disclaimer please notify the sender immediately and do not disclose the contents to anyone or make a
copy thereof.
quoted > On WEEKDAY,
quoted MONTH DAYNUMBER,
quoted YEAR at HH:MM ACCOUNT <[email protected]> wrote:
quoted > Hi NAME.
(…) (…)
Figure 4.1 – Excerpt from a CLEVERLY labeled email message. Personal and other sensible instances
were replaced by a default token that indicates their content.
Figure 4.1 shows an example email containing all 8 zones. The zone annotation was done by
4 CLEVERLY annotators resorting to the prodigy47 annotation tool, since this was the tool used by
CLEVERLY for all internal annotation processes. All annotators were native Portuguese speakers,
capable of speaking at least 3 of the corpus languages. Zones were defined at the token level (word,
punctuation mark, etc.) and then mapped at the line level. We consider lines as sequences of
characters and tokens delimited by punctuation marks (except for email addresses, website, etc.) or
by a tab “\n”. Single blank and consecutive blank lines were labeled with the same zone as the
previous non-blank line. After a quoted line, all next lines are considered as quoted. If we want to
47
https://prodi.gy/
36
extract more detailed zone information from quoted, this can be done by considering each quoted
segment (sequence of quoted lines) as a single email and passing it through the email zoning model.
Table 4.1 characterizes each language present in the corpus. English, with 8,906 tickets, represents
more than half of the total number of tickets. Portuguese is the second language with more tickets,
with 4,869, roughly half of the English tickets. Spanish (756), French (642) and Italian (284) tickets
summed up are still less than Portuguese tickets. These numbers reveal the presence of an
imbalance regarding the number of tickets available for each language corpus.
As concerning the average number of lines per ticket, Italian tickets (37.5 lines) are, by far,
the longer ones, while English tickets are the shortest (15.5 lines). The average number of zones per
ticket is higher in Italian, Portuguese, and French - all with approximately 3 zones per ticket on
average - and smaller in English tickets (2.7 zones).
80%
% of total lines
60%
40%
20%
0%
Figure 4.2 – Percentage of total lines for email zone, for each language in CLEVERLY corpus.
Figure 4.2 depicts, for each language, the percentage of the total lines per zone. Clearly, quoted is the
zone with more lines in every language, with between 40% to 80% of the lines; body is the second
most prominent zone in every language, ranging between 30% to 40%. All other zones have a similar
distribution across language, with attachment (less than 1%) being the zone with less occurrences.
37
We note the Italian corpus has a disproportionate amount of quoted lines, which may explain
why it is the language with a higher value of average lines per ticket, since quoted zones come, on
average, in sequences of 60 lines (see Table 4.2).
The degree of distinctness between the defined zones does not only rely on the textual
content but also on graphical features, such as zones position on the ticket or the number of
consecutive sentences that constitute a zone. Furthermore, zones tend to precede and follow
different zones. These phenomena may help us understand the typical constitution of CLEVERLY’s
tickets.
In Table 4.2 we characterize each email zone from CLEVERLY corpus. In summary, most zones
tend to have a similar average number of consecutive lines, with some exceptions, being quoted the
most notable one, with more than 59 consecutive lines on average. Nevertheless, quoted zone lines
average number of tokens is close to the values found in the other zones, being body the zone where
each line has more tokens on average (10.9 tokens). body is the leading zone for half of the zones
and the following zone for two of them. Regarding the zone’s position in the tickets, greeting and
context details tend to appear as the first zone, body as the second, attachment, signoff, signature,
and disclaimer as third and quoted as the fourth zone.
Finally, for every language, our train, validation and test splits resemble the ones from
related work, namely Repke & Krestel (2018) and Bevendorff et al. (2020) corpora. This way, we split
the corpora into a train and test according to the following ratios: 80% for train and 20% test. Then,
we divided the train set into 80% for train and 20% for validation. These divisions can be found in
Table 4.3.
38
Considering the amount of available tickets, these splits assure that we have enough
representativity to train, validate and test models in each language. Nevertheless, because we are in
the presence of corpora with different sizes, we need to be cautious when comparing the results
from models trained in English or Portuguese and models trained in the other three languages, has
these have less training data. Likewise, the validation and test splits may also lead to wrong
conclusions across languages, since a smaller corpus may not have the same variability as a larger
one.
In this section we analyze the public corpora used in the scope of this research: Repke &
Krestel (2018) Enron and ASF corpora and Bevendorff et al. (2020) Gmane and Enron corpora.
As discussed in section 3.1.3, Repke & Krestel (2018) objective task is thread reconstruction,
which can be seen as subtask of email zoning. Thread reconstruction consists of extracting
consecutive sequences of email headers and body blocks. This task can be done via a 2-zone
approach, where email headers are classified with the label header and body blocks are classified as
the label body. The 2-zone approach can be extended to a 5-zone approach, where the body zone
can be divided into greeting, body, signoff, and signature (see Figure 3.3).
Enron ASF
Train Test Val Train Test Val
# emails 500 200 100 222 90 45
# threads 221 106 53 147 50 29
thread length 2.7 2.4 2.4 3.3 3.7 3.2
Table 4.4 shows the division of the corpora into train, test, and validation and respective
statistics. In both corpora, the number of emails with threads is around half of the total number of
emails. The ASF corpus seems to have longer threads and the email size is roughly three times longer
than the size of emails in the Enron corpus. This is reflected in the ASF corpus greater number of
zones per email. In both cases, tickets are, on average, longer and with more zones than CLEVERLY’s
ones.
Furthermore, the distribution of zone occurrence for both corpora shown in Figure 4.3
reveals a similar pattern in both corpora. For a two-zone classification, body comprises more than
80% of the lines, a pattern that differs from CLEVERLY corpora, in which most of the lines were quoted.
This phenomenon can be explained by the thread reconstruction zone taxonomy, which considers
body lines after headers. For five zones, body is again the zone with more occurrences in both corpus,
making up to 76% of the lines in the Enron corpus and 84% in ASF; header is the second zone with
39
more lines for both corpus, while greeting is the least occurring zone in the Enron corpus,
representing 3% of the lines and signature is the least occurring zone for the ASF corpus, with 2% of
the lines.
Enron ASF
100%
84%
80% 76%
% of total lines
60%
40%
20% 12%
5% 6% 4% 5%
3% 3% 2%
0%
body greeting closing signature header
Figure 4.3 – Percentage of the total lines per zone. Comparison between the Enron corpus
(green/left) and the ASF corpus (yellow/right).
Table 4.5 compiles the statistics for the corpora of Bevendorff et al. (2020).
Gmane Enron
Train Test Train Test
# emails 2733 300 60 236
# zones 15 15 12 13
# languages 31 14 1 1
% english 88% 87% 100% 100%
# lines 115041 11211 907 4527
# lines / email 42.1 37.4 15.0 19.2
# zones / email 6.4 6.6 4.4 5.9
Table 4.5 – Bevendorff et al. (2020) available Gmane and Enron corpus in numbers.
On average, Gmane emails are longer and have more zones than emails from the Enron
corpus, which has a structure in line with the one observed in Repke & Krestel (2018) Enron corpus.
When it comes to the zone distribution, while the technical nature of the Gmane emails leads to a
larger amount of lines corresponding to technical zones (e.g., log data, patch), in the Enron corpus
those are non-existing. Furthermore, in the Enron corpus most lines are paragraphs and quoted
zones have less occurrences than in the Gmane corpus. Even though the corpus is composed of 31
languages, the emails are mostly in English, and the test set only contains a residual number of non-
40
English emails (38 emails covering 14 different languages), which is insufficient for a consistent
multilingual evaluation.
From the Gmane 3,033 annotated emails, 2766 are destined for training, leaving 300 for
testing. Line distribution is the following: quotation (50%), paragraph (17%), patch (10%), log data
(5%), mua signature (4%), raw code (3%), visual separator (3%), tabular (2%), personal signature
(2%), closing (2%), quotation marker (2%), salutation (1%), technical (0.5%), inline header (0.3%) and
section heading (0.3%).
The Enron corpus contains a total of 300 tickets, from which 60 are destined for fine-tuning
the model and 236 for testing. The line distribution for this corpus is the following: paragraph (55%),
quotation (12%) and inline header (7%), quotation marker (7%), closing (6%), personal signature (3%),
mua signature (2%), salutation (2%), visual separator (2%), section heading (1%), tabular (1%),
technical (1%) and log data (0.5%). The Enron corpus does not have lines of raw code and patch.
We searched the Gmane raw corpus of emails and, following the zone classification schema
proposed by Bevendorff et al. (2020), produced a total of 625 annotated emails in Portuguese,
Spanish and French. This corpus is available at https://github.com/cleverly-ai/multilingual-email-
zoning.
Table 4.6 compiles a brief description of the email statistics for each of the languages. While
French is the language with more emails, Portuguese emails tend to be longer, resulting in a greater
amount of lines and a higher average number of zones per email. The Spanish and French emails
resemble the structure of Bevendorff et al. (2020) Gmane emails. For the three languages, our corpus
reveals a similar pattern regarding the number of lines for each zone. Similarly to Bevendorff et al.
(2020) Gmane corpus, our corpus also considers 15 zones that can be described in detail as follows:
• paragraph: main content; new content from the current email sender.
• closing: content that closes the email (ex: Regards, Some Name).
• inline headers: contain lines from information that is used to contextualize the details of a
previous message in the thread, such as addresses, dates, recipients, etc.
• log data: contain information about configurations or events that have occurred within a
software application like error messages or loading of files.
• mua signature: include legal disclaimers, privacy statements, or other automatically
generated texts advising the people that interact with the email. Also contained under MUA
41
signature are automatically generated message lines that define from which device the email
was sent, mail user agent and mailing list details or advertisements (e.g. Sent from iPhone).
• patch: contain text with pieces of code used to make changes to a computer program or its
supporting data in order to update, fix, or improve it. This includes fixing security
vulnerabilities and other bugs (source code diffs).
• personal signature: content containing contact or other personal information such as name,
job title, company, phone number, etc., that is automatically inserted at the end of an email
message.
• quotation: include contents quoted from previous messages in the same conversation thread
and forwarded content from other conversations. Each line typically starts with a “>”.
• quotation marker: initiate quotation zones stating the quotation author and date of a
quotation.
• raw code: contain text regarding code being developed in any programming languages
(source code).
• salutation: contain the terms of address and recipient names at the beginning of a message.
• section heading: lines that are headers of other zones, mainly of paragraph zone.
• tabular: text in a semi-structured tabular or matrix format.
• technical: contain lines of automatic messages generated by Gmane regarding initiation,
ending or elimination of email parts; inline attachments or PGP signatures.
• visual separator: lines with no alphanumeric characters, used to divide the text between
multiple zones.
The distribution of zones is similar between the three languages, as detailed in Table 4.7. The
most common zones are quotation, paragraph and mua signature, while the zones technical, patch
and section heading are the zones with less occurrences. Moreover, the corpus shows a less technical
nature when compared to Bevendorff et al. (2020) Gmane corpus, since lines of log data, raw code
technical or patch are less common.
When creating the corpus, we wanted to ensure that the annotation process, i.e. the process
of manually defining the zones present in the emails, led to a correct identification of those email
zones. One common way to increase the robustness of the annotations is to have more than one
person annotating the corpus (Yan, Rosales, Fung, Subramanian, & Dy, 2014). This way, the
annotation was carried out by two annotators. The first annotator was a native Portuguese speaker
and the second annotator a native Spanish speaker, both with academic background in French and
fluent in the third language. Each email was annotated by both annotators using the tagtog48
annotation tool. We chose this option instead of the previously used prodigy, since these
annotations were developed outside CLEVERLY’s domain and tagtog was the most efficient cloud and
on-Premises text annotation tool we found that does not require a premium account to have access
to the basic annotation features.
48
https://www.tagtog.net/
42
Portuguese (%) Spanish (%) French (%)
quotation 52.43 59.02 46.20
paragraph 16.33 17.36 27.61
mua signature 12.04 3.84 9.04
personal signature 3.93 4.47 2.00
visual separator 2.94 2.29 2.60
quotation marker 2.72 1.54 2.10
closing 2.63 2.00 3.73
log data 1.04 3.79 1.82
raw code 1.28 2.45 2.07
inline headers 2.96 0.82 1.33
salutation 0.96 0.81 1.35
tabular 0.32 0.42 0.27
technical 0.30 1.00 0.38
patch 0.02 0.20 0.02
section heading 0.15 0.04 0.03
Table 4.7 – Distribution, for each language, of the number lines per zone in our Multilingual corpus.
The distribution was obtained by averaging the values for both annotators.
4.2. MODELS
In this section, we propose five systems based on neural network architectures trained in a
supervised fashion. To face the multilingual character of email corpora, we test different embedding
methods with the objective of increasing model capacity to generalize learnings to unseen languages.
The generated embeddings are fed into a BiLSTM sentence encoder. Since, email zoning can be
postulated as a sequential task, in which we process sequences of tokens or sequences of email lines
as input, and the classification of a zone depends on the previous and posterior sequences of zones,
we based the architecture of our systems in a BiLSTM network. Moreover, previous work in email
zoning, namely QUAGGA (Repke & Krestel, 2018) and CHIPMUNK (Bevendorff et al., 2020), have used
RNN based networks such as GRU or LSTMs, also seen in state-of-the-art text segmentation models
(Badjatiya et al., 2018; Koshorek et al., 2018; Li et al., 2018; Lukasik et al., 2020). We also consider
two different approaches for the system’s output layer - one considering a softmax layer and another
using a CRF layer, which is typically used for sequential tasks.
Our proposed baseline model is based on word-level embeddings and a BiLSTM. We first divide each
email line 𝑠𝑗 in a set of word-level tokens, which are fed into a word2vec49 (Mikolov et al., 2013)
(𝑗) (𝑗)
embedding generator to train 𝑤𝑛 word representations for each token 𝑥𝑛 , ending up with a
features matrix for each line 𝑠𝑗 = [𝑤0 , 𝑤1 , . . . , 𝑤𝑛 ]. Each email 𝑡, composed of line-level word
49
https://code.google.com/archive/p/word2vec/
43
embeddings is passed through a BiLSTM network that returns the last hidden state for the forward
and backward networks, resulting in a matrix of line feature representations 𝑡 = [𝑓0 , 𝑓1 , . . . , 𝑓𝑗 ], which
goes through a softmax layer that weights each sentence to the corresponding zone type. The zone
with a higher weight is used to classify the sentence. Figure 4.4 presents a simplified overview of the
W-BiLSTM architecture, showing a single sentence entering the BiLSTM module.
Figure 4.4 – W-BiLSTM model overview. The model is divided in two parts: 1) a sentence-level
word2vec word encoder; and 2) a segmentation module that uses a BiLSTM and a softmax output
layer to classify each sentence into an email zone. Although the BiLSTM receives the sequence of
sentences in an email, for simplicity, we illustrate the process for a single sentence.
This model will hardly be able to generalize its learning to different languages than the ones
used in training, since the number of out of vocabulary words (OOV) will be large. Nevertheless, it
serves as a simple baseline to compare its results to further models that use different embedding
techniques.
Based on the idea that various words are translatable via subword units (Sennrich et al., 2016),
subwords can allow us to overcome the problem of OOV words and to retrieve meaning for words
that share same subword units. This way, to test the effectiveness of these representations in a
multilingual context, we use SentencePiece50 to divide sentences into subword units. Then, similarly
to what we do in our baseline, we use word2vec to produce the subword embeddings for each
sentence. Figure 4.5 illustrates this model.
50
https://github.com/google/sentencepiece
44
Figure 4.5 – Sw-BiLSTM model overview. The model uses the same architecture as the W-BiLSTM but
encodes sentences at the subword-level.
45
4.2.2. Word and Subword Embeddings + BiLSTM (WSw-BiLSTM)
The usefulness of subword embeddings on multilingual tasks may come with the drawback of quality
loss in the representations of vocabulary words. Hence, in this model we use both word-level and
subword-level sentence representations in the hope of benefiting from both approaches. Word and
subword sentence representations are generated with word2vec and fed to parallel BiLSTMs.
Figure 4.6 – WSw-BiLSTM model overview. The model produces parallel sentence representations at
word and subword level. The representations are concatenated for each sentence and fed into the
output layer.
46
Figure 4.7 – Overview of XLM-RoBERTa embedding extraction steps.
Next, we pass all line embeddings of an email 𝑡 = [𝑠0 , 𝑠1 , . . . , 𝑠𝑘 ] into a BiLSTM, as in BERT-
BiLSTM (Lukasik et al., 2020), to derive compact line representations that encompass information
from the entire structure of the email. Our BiLSTM consists of 1 layer and 64 hidden units. Like in the
previous models, we use a softmax output layer to predict the zone class of each line in the
document. Figure 4.8 illustrates the complete system.
Figure 4.8 – XLMR-BILSTM is composed of two building blocks: 1) a multilingual sentence encoder
(XLM-RoBERTa) to derive sentence embeddings; and 2) a segmentation module that uses a BiLSTM
and a softmax output layer to classify each sentence into an email zone.
47
4.2.4. XLM-RoBERTa Embeddings + BiLSTM + CRF (XLMR-BiLSTM-CRF or OKAPI)
A common architecture used in sequential tasks to efficiently improve model performance is to add a
CRF layer after the BiLSTM layer (Huang et al., 2015). By using a CRF layer as the output layer instead
of a softmax or other activation function, models can add to the past and future input sequence
information obtained by the BiLSTM layer, sequential sentence-level zone information via the CRF
layer. To test the efficiency of this method we changed the output layer of the XLMR-BiLSTM from a
softmax to a CRF layer. This architecture is shown in Figure 4.9.
Figure 4.9 – OKAPI model overview. OKAPI follows the same architecture as XLMR-BiLSTM model
except for the output layer, in which it uses CRF to classify each sentence into an email zone.
We decided to dub this model OKAPI, since most recent email zoning works from the
literature have dubbed their systems with a name of a listed animal and we use this model to
benchmark our results on public corpora.
We resort to metrics that have been used to evaluate other systems from the related work, since
these enable us to compare their performance against our systems. We also use other useful metrics
present in the related literature, such as the f1-score, to help us understand in more detail the
48
efficiency of our models. Table 4.8 compiles the evaluation metrics used to measure the
performance of our models for each corpus.
Evaluation Metrics
Corpora Overall Each zone
CLEVERLY precision/recall/ f1-score * precision/recall/f1-score
Enron & ASF (Repke & Krestel, 2018) precision/recall/ f1-score -
Gmane & Enron (Bevendorff et al., 2020) accuracy recall
New Multilingual Corpora accuracy recall
Table 4.8 – Evaluation metrics used to measure model performance for each corpus. *For CLEVERLY,
we only use overall precision and overall recall when analyzing the results of the best model for each
zone.
When evaluating model predictions, we consider the annotated zones described in section
4.1 as the ground-truth, as they have been handled by subject-matter experts and thus we are much
more certain of the correctness of the zone structure in any given email.
To evaluate the performance for each zone individually, Repke & Krestel (2018) and
Bevendorff et al. (2020) use recall. Formally, recall is the ratio between true positive predictions
between all observations that are positive:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 = (5.1)
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
Since most email zoning schemas consider multiple zones, the metrics for each zone are
calculated via a pseudo-binary classification, where the current zone is considered as a positive class
and all the other zones are considered as negative class. This way, a true positive is an outcome
where the prediction for a positive class matches the ground-truth positive class, i.e. a model
correctly predicts a positive class. Similarly, true negative is an outcome where a model correctly
predicts a negative class. On the other hand, a false positive is an outcome where the model predicts
a negative class as positive, while a false negative occurs when a model predicts a positive class as
negative.
Repke & Krestel (2018) also consider precision to evaluate performance for each zone.
Precision is the ratio between true positive predicted observations and the total number of predicted
positive observations (equation 5.2). It shows the rate in which a model correctly predicts a label of
the times it tries to predict that label.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5.2)
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Finally, to evaluate the overall performance of email zoning systems, Repke & Krestel (2018)
and Bevendorff et al. (2020) use accuracy. Formally, accuracy can be defined as the ratio between
the sum of true positives and false negatives, and the sum of true positives, true negatives, false
positives, and false negatives:
49
Since for the calculation of overall performance we are considering all classes at the same
time, equation 5.3 can be simplified as the ratio between correctly predicted observations and all
observations.
Nevertheless, because we are dealing with a multiclass classification51 with a large class
imbalance, i.e., some zones are not represented equally, accuracy may be a misleading measure. For
example, a system that correctly predicts the value for the majority class will achieve high accuracy
values, even if it completely fails to predict less represented classes.
For this reason, to evaluate our models in CLEVERLY corpus, instead of considering accuracy as
the overall performance metric, we will use f1-score. The f1-score is a harmonic mean between
precision and recall (see equation 5.4) and it is one of the most common evaluation metrics found in
multiclass sequential tasks, such as sequence labeling (Huang et al., 2015) or text segmentation
(Lukasik et al., 2020).
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = 2 × (5.4)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
To find the overall values for precision, recall and f1-score, the value of these metrics for each
zone 𝑧 is multiplied by the number of lines 𝑛 of each 𝑧, the resulting values are then summed up and
divided by the total number of lines 𝑁:
1
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ∑ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑧 × 𝑛𝑧 (5.5)
𝑁
𝑧
1
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑅𝑒𝑐𝑎𝑙𝑙 = ∑ 𝑅𝑒𝑐𝑎𝑙𝑙𝑧 × 𝑛𝑧 (5.6)
𝑁
𝑧
1
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = ∑ 𝐹1 − 𝑆𝑐𝑜𝑟𝑒𝑧 × 𝑛𝑧 (5.7)
𝑁
𝑧
To assess the validity of any corpus annotations and ensure its reproducibility, it is crucial to measure
the level of uniformity between each annotator’s choices (Inter-annotator agreement). This way, to
measure the Inter-annotator agreement regarding the annotations made for our Multilingual corpus
(section 4.1.3) we resort to accuracy, as it is the metric used to measure the overall performance of
models on the corpus, and to f1-score, which has been used in other scenarios to measure inter-
annotator agreement (Almeida, Pinto, Figueira, Mendes, & Martins, 2015). f1-score is calculated
considering one annotator (A1) versus the other (A2). In other words, first, we consider A1 as the
ground-truth and A2 as the predictions and then, the other way around.
Moreover, we also use Cohen’s kappa (k) (Cohen, 1960) to measure inter-annotator
agreement. This metric is generally thought of being more robust than the accuracy and f1-score, as it
considers the possibility of annotator agreement occurring by chance. It can be defined as follows:
51
problem of classifying instances into one of three or more classes.
50
𝑝0 − 𝑝𝑒
𝑘= (5.8)
𝑝𝑒 − 1
In equation 5.8, 𝑝0 represents the annotator relative agreement, the same as accuracy. 𝑝𝑒 is
the hypothetical probability of agreement by chance:
1
𝑝𝑒 = ∑ 𝑛𝑘1 𝑛𝑘2 (5.9)
𝑁2
𝑘
In equation 5.9, 𝑛𝑘𝑖 is the number of times an annotator 𝑖 predicted category 𝑘 and 𝑁 is the
number of observations. A total agreement between the annotators would result in 𝑘 = 1, while a
complete disagreement, with the exception of the chance agreement given by 𝑝𝑒 , would result in
𝑘 = 0.
51
52
5. RESULTS AND DISCUSSION
This chapter presents and discusses the performance of the models developed within the scope of
this research. The effectiveness of our systems is evaluated by measuring their capability of
effectively segmenting CLEVERLY’s emails (section 5.2.1), evaluating the impact of architecture
variations such as different embedding methods and output layers. Moreover, we analyze in more
detail the results of the best performing model (section 5.2.2) and understand the impact of
implementing this model in CLEVERLY’s classification pipeline (section 5.2.3), by comparing the results
of the downstream classification model before and after its implementation.
In section 5.3, the model with the best results on CLEVERLY corpus, OKAPI, is benchmarked in
public corpora and we compare its results with other email zoning systems from the literature
(section 5.3.1). Finally, we test OKAPI in our new multilingual corpus (section 5.3.2).
For the W-BiLSTM, Sw-BiLSTM and WSw-BiLSTM, we resort to the training corpus of each
experiment to train and extract word2vec embeddings for each token. This process is done
separately from the segmentation module and the embedding values are kept fixed throughout the
model’s training process. We experimented a BiLSTM with 16, 32, 64, 128, 256 and 512 hidden units
and 1 and 2 layers. We tested a dropout with values until 0.50, including no dropout. We also tested
the training optimizers Adam (Kingma & Ba, 2015) and RMSprop, and different values for the
learning rates. In all three models, the best performances were achieved using a single BiLSTM with
128 hidden units and 1 layer, with a dropout of 0.25 and using the RMSprop optimizer with a fixed
learning rate of 0.001 and the sparse categorial cross entropy loss function. In the case of the WSw-
BiLSTM model, we use two BiLSTMs of this kind, one to encode words and another for subwords.
As for the XLMR-BiLSTM and OKAPI, we use the XLM-RoBERTa pre-trained weights, which are
kept frozen during training and only the segmentation module layers are updated. Similarly to the
previous models, we experimented BiLSTM with the same different number of hidden units and 1
and 2 layers but, in the end, having a small segmentation module, with 64 hidden units and 1 layer,
generically yielded the best performances in the validation splits. We used a dropout layer of 0.25
between the BiLSTM and the output layer, the RMSprop optimizer with a fixed learning rate of 0.001
the sparse categorial cross entropy loss function for the XLMR-BiLSTM and negative log-likelihood
loss function for OKAPI.
For each training language in CLEVERLY corpus, the ideal epoch number to train the models
varies. Nonetheless, for batches with size of 32, the W-BiLSTM model is optimally trained with 8 to
12 epochs, the Sw-BiLSTM and the WSw-BiLSTM with 12 to 16 epochs. XLMR-BILSTM and OKAPI are
optimally trained with a batch size of 16 for 5 to 10 epochs.
53
For the public corpora experiments, Okapi maintains the described parameter values.
Nonetheless the number of epochs used changes depending on the training corpus (see section 5.3).
We start by measuring the impact of the proposed embedding approaches in model performance,
since these are the first variations we introduce in the architecture of the proposed models (section
4.2). Table 5.1 compiles these results, showing the evaluation for same language prediction and for
cross-lingual zero-shot prediction52. Overall, all five models achieve high f1-score when tested in the
language used for training, but generally show a deterioration in the results when dealing with
unseen languages.
When comparing the word-level embedding approach of the W-BiLSTM model against the
subword-level approach of the Sw-BiLSTM model, in general, the latter seems to achieve similar or
higher f1-score, except when the models are trained in the English language. These results support
the hypothesis that subword units are especially helpful when models must deal with OOV words.
Nevertheless, since English tickets are overrepresented in CLEVERLY tickets and a myriad of
English expressions are used in emails worldwide, the OOV problem seems to be mitigated, as
models trained in English generally achieve better performances in all test scenarios compared to
when using other languages for training.
Joining both approaches in the WSw-BiLSTM model seems to have a positive impact in the
cases where the W-BiLSTM performances are superior to the Sw-BiLSTM ones. In the remaining
cases, the pattern of improvement is not clear, or performances deteriorate.
The results show that the best embedding solution is the XLM-Roberta pre-trained
embeddings (Conneau et al., 2020). The XLMR-BiLSTM achieves superior f1-score for same language
experiments, while maintaining its zoning capacity competitive for cross-lingual zero-shot email
zoning, proving the effectiveness of pre-trained contextual multilingual embeddings when compared
to the simpler neural word embeddings of word2vec (Mikolov et al., 2013), both for monolingual and
multilingual tasks.
52
Task of testing a model in languages not seen during training.
54
Test Language
Train Language Model English Portuguese Spanish French Italian
W-BiLSTM 0.89 0.56 0.65 0.79 0.82
English Sw-BiLSTM 0.88 0.57 0.73 0.76 0.78
WSw-BiLSTM 0.89 0.56 0.70 0.79 0.83
XLMR-BiLSTM 0.92 0.81 0.91 0.93 0.93
W-BiLSTM 0.33 0.76 0.56 0.49 0.32
Portuguese Sw-BLISTM 0.41 0.71 0.61 0.56 0.49
WSw-BiLSTM 0.42 0.76 0.59 0.57 0.53
XLM-R-BiLSTM 0.82 0.93 0.88 0.92 0.93
W-BiLSTM 0.50 0.40 0.85 0.58 0.52
Sw-BLISTM 0.51 0.42 0.82 0.60 0.72
Spanish
WSw-BiLSTM 0.45 0.41 0.85 0.67 0.59
XLMR-BiLSTM 0.67 0.63 0.90 0.85 0.86
W-BiLSTM 0.35 0.35 0.56 0.87 0.51
French Sw-BLISTM 0.36 0.36 0.68 0.82 0.68
WSw-BiLSTM 0.30 0.31 0.51 0.89 0.45
XLMR-BiLSTM 0.81 0.61 0.86 0.95 0.91
W-BiLSTM 0.55 0.35 0.56 0.56 0.75
Italian Sw-BiLSTM 0.60 0.33 0.62 0.60 0.76
WSw-BiLSTM 0.60 0.38 0.66 0.63 0.77
XLMR-BiLSTM 0.77 0.54 0.85 0.87 0.94
Table 5.1 – f1-score comparison of embedding method impact for CLEVERLY corpus. Models
were trained and tested for each language. Best result is highlighted for each train language.
Following the same evaluation approach, we also test the impact of using CRF output layer
instead of a standard softmax output layer. These results are compiled in Table 5.2.
Overall, OKAPI achieves equal or slightly superior f1-score than XLMR-BiLSTM. Even though
having CRF as the output layer does not show significant performance improvements when the
training is done in English and Portuguese (the two languages with more emails), the CRF positive
impact becomes more evident for smaller training corpus. This pattern is more perceptible for cross-
lingual zero-shot prediction, being Spanish and Italian the two training languages where the f1-score
improvement of OKAPI is more noticeable. For the test languages, English is the only where there is a
clear pattern of one model outperforming the other, with OKAPI achieving higher f1-score.
In conclusion, the evaluation results indicate that using XLM-Roberta pre-trained embeddings
and having CRF as the output layer, generally leads to the best results on CLEVERLY corpus. This way,
in the following sections, we resort to OKAPI for the remaining tests regarding CLEVERLY’s case study
and public corpora.
55
Test Language
Train Language Model English Portuguese Spanish French Italian
XLMR-BiLSTM 0.92 0.81 0.91 0.93 0.93
English
OKAPI (XLMR-BiLSTM-CRF) 0.92 0.86 0.91 0.93 0.90
XLMR-BiLSTM 0.82 0.93 0.88 0.92 0.93
Portuguese
OKAPI (XLMR-BiLSTM-CRF) 0.83 0.92 0.88 0.90 0.85
XLMR-BiLSTM 0.67 0.63 0.90 0.85 0.86
Spanish
OKAPI (XLMR-BiLSTM-CRF) 0.81 0.76 0.91 0.93 0.92
XLMR-BiLSTM 0.81 0.61 0.86 0.95 0.91
French
OKAPI (XLMR-BiLSTM-CRF) 0.82 0.61 0.86 0.95 0.91
XLMR-BiLSTM 0.77 0.54 0.85 0.87 0.94
Italian
OKAPI (XLMR-BiLSTM-CRF) 0.80 0.63 0.85 0.89 0.95
Table 5.2 – f1-score comparison of output layer impact for the model with higher performing
embedding method. The models were trained and tested for each language. Best result is highlighted
for each train-test language combination.
To understand in more depth OKAPI’s email zoning effectiveness, we also detail the model’s
performance at the zone level. Table 5.3 presents OKAPI’s precision, recall and f1-score results
discriminated by zone. To train the model, we compile every language training set into a single
multilingual corpus (compiled multilingual training set). An equivalent process was followed to obtain
the validation and test corpus for this experiment.
OKAPI achieves an overall precision of 93%, overall recall of 94% and overall f1-score of 93%.
Half of the zones have precision, recall and f1-score values above 90%. The model is better at
classifying zones with more representation – quoted (49% of the lines) and body (26% of the lines)
are the two zones with higher f1-score values, with 98% and 94% respectively; while for attachment
(only 0.3% of the lines) the model achieves a f1-score of only 21%.
56
There is also a clear pattern showing that OKAPI typically achieves recall values that are higher
than precision values for zones with more representation in the corpus, and the opposite for zones
with less representation. In other words, the model typically predicts more false positives than false
negatives for zones with more occurrences, and less false positives than false negatives for zones
with less occurrences.
To better understand the previous pattern, Figure 5.1 presents the confusion matrix for the
zone-level evaluation. The most notable prediction error OKAPI makes is to mistake attachment lines
for body lines, with 73% of the attachment lines being classified as body. Context details (5%),
greeting (6%) and signoff (10%) are most commonly confused with body, while disclaimer (9%) and
signature (8%) are more confused with quoted. In both scenarios, zones are being confused with the
surrounding zone with more representation (see Table 4.2). When body and quoted lines are wrongly
predicted, they tend to be confused with each other, with body being predicted as quoted 2% of the
times and quoted predicted as body less than 1% of the times.
Figure 5.1 – Confusion matrix for OKAPI’s email zoning results on CLEVERLY’s multilingual corpus. On
the left the true labels and on the bottom the predicted labels. The darker the square, the more lines
are predicted with the column’s label.
Additionally, we experiment with the hypothesis that only body lines should be retained for
the downstream classification model. Hence, we reduce CLEVERLY’s zoning schema to a two-level
schema, in which, in addition to the body zone, we consider all other zones as the zone other. Table
57
5.4 compiles the results for this experiment. OKAPI achieves an overall accuracy of 96% and f1-score of
97%. In this case, for the body zone, recall is perhaps the most important measure as it accounts for
the number of false negative predictions – lines that actually belong to body but are predicted as
other. A large number of false negative predictions results in a decrease in the amount of useful
training text provided to the classifier and, consequently, can lead to a deterioration in its
effectiveness. Although only 5% of the body lines are filtered out (recall is 95%), optimally no body
instances should be wrongly removed.
To measure the impact of OKAPI when implemented in CLEVERLY’s ticket classification pipeline, we
analyze the time taken by OKAPI in the main steps of its email zoning process, and compare the
accuracy of CLEVERLY’s downstream ticket classification model when using OKAPI against when using
CLEVERLY’s current solution. Thus, we used OKAPI trained with the compiled multilingual training set
from the CLEVERLY corpus to zone 84,929 other tickets mostly in English and Portuguese, but also in
French, Indonesian, Latin, Kinyarwanda, and other unidentified languages.
In Table 5.5 we measure the time take by OKAPI to produce the embeddings for each ticket
and ticket line, the segmentation time per ticket (time taken by the segmentation module) and the
overall zoning time – sum of time taken to produce the embeddings and the time taken to segment a
ticket into zones.
With an average sentence size of 14.5 XLM-R subword units and an average of 20.3 lines per
ticket, OKAPI takes, on average, 0.06 seconds to generate a XLM-RoBERTa line embedding and 1.21
seconds to generate all line embeddings in a ticket. The segmentation process is much quicker than
the embedding generation, taking on average 0.01 seconds per ticket. These results indicate that the
embedding layer, despite its contribution to the improvement of email zoning performance, can be a
bottleneck regarding ticket processing time. Comparing the time statistics between our model and
CLEVERLY’s current solution would allow to better comprehend OKAPI’S times. Nevertheless, those
time statistics are not available.
58
Furthermore, we compare CLEVERLY’s classifier accuracy using OKAPI against the classifier’s
accuracy using the current solution. Table 5.6 compiles both accuracies. The results show that the
introduction of OKAPI in the classification pipeline does not have an impact on the classifier’s
performance. The fact that both solutions lead to the same classification accuracy (71%), indicates
that after a simple filtering from the current preprocessing method, the classifier, probably due to its
own effectiveness, is already able to distinguish relevant content from noisy content in a ticket.
Resorting to the numbers reported in the email zoning literature, we compared OKAPI with existing
monolingual methods using different English corpora and zoning schemas.
Table 5.7 compares OKAPI with three other zoning systems on the corpora annotated by
Repke & Krestel (2018) with 2 and 5 zones. To ensure comparability between the results, we strictly
follow the train, validation and test division done by Repke & Krestel (2018) and presented in section
4.1.2.1. We trained OKAPI with a batch size of 16, for 9 epochs on the Enron corpus training set and 7
epochs on the ASF corpus training set.
JANGADA (Carvalho & Cohen, 2004) achieves comparable results to the ones found in its
original implementation, in particular for a 2-zone classification (97% for the ASF corpus). ZEBRA
(Lampert et al., 2009) does not reproduce the same performance with accuracies below 25% for a 5-
zone classification. From the three literature models, QUAGGA (Repke & Krestel, 2018) achieves the
highest accuracies, with 98% for both corpora with a 2-zone classification, and, for a 5-zone
classification, 93% and 95% accuracy, for the Enron and ASF corpus, respectively.
OKAPI shows excellent performances and, in general, surpasses the results of QUAGGA. It
produces an almost perfect 2-zone classification for both corpora, with 99% accuracy; for a 5-zone
classification our model outperforms QUAGGA with 96% accuracy in the Enron corpus and achieves
the same 95% accuracy as Repke & Krestel (2018) model in the ASF corpus.
59
Model Zones Enron ASF
JANGADA 2 0.89/0.88/0.88 0.97/0.97/0.97
ZEBRA 2 0.66/0.25/0.25 0.88/0.18/0.18
QUAGGA 2 0.98/0.98/0.98 0.98/0.98/0.98
OKAPI 2 0.99/0.99/0.99 0.99/0.99/0.99
JANGADA 5 0.82/0.85/0.85 0.92/0.90/0.91
ZEBRA 5 0.60/0.25/0.24 0.81/0.20/0.20
QUAGGA 5 0.93/0.93/0.93 0.95/0.95/0.95
OKAPI 5 0.96/0.96/0.96 0.95/0.95/0.95
Table 5.7 – OKAPI email zoning performance (precision/recall/accuracy) compared to various models
from the literature, for Repke & Krestel (2018) corpora in a 2-zone and 5-zone schema.
Table 5.8 shows the results obtained by OKAPI in Bevendorff et al. (2020) corpora and zoning
schema. To train our model, we follow Bevendorff et al. (2020) original train and test division.
Nevertheless, since the authors did not provide a set for model validation, we used 10% of the train
English Gmane corpus to validate our model. We choose a small split so that we maintain, as much as
possible, similar conditions to the ones followed by Bevendorff et al. (2020). OKAPI was trained with a
batch size of 16, for 11 epochs on Gmane corpus training set and 4 epochs on the Enron training set
to fine tune the model and test it on the Enron test set.
Gmane Enron
OKAPI CHIPMUNK QUAGGA Tang OKAPI CHIPMUNK QUAGGA Tang
all zones 0.96 0.96 0.94 0.80 0.88 0.88 0.83 0.72
quotation 0.98 0.99 0.99 0.99 0.94 0.99 0.88 0.85
patch 0.98 0.95 0.95 0.46 - - - -
paragraph 0.94 0.93 0.90 0.90 0.94 0.95 0.91 0.89
log data 0.83 0.84 0.77 - 0.00 0.24 0.74 -
mua signature 0.92 0.91 0.93 0.45 0.65 0.51
0.40 0.21
personal signature 0.86 0.77 0.85 0.73 0.85 0.78
Table 5.8 – OKAPI email zoning overall accuracy and 6 most common zones recall, compared to
various models, under the 15-level zoning schema and corpora of Bevendorff et al. (2020).
OKAPI and CHIPMUNK (Bevendorff et al., 2020) achieve the highest performance for the Gmane
corpus, with an all zone accuracy of 96%. Although OKAPI, CHIPMUNK and QUAGGA (94%) perform
almost equally well, OKAPI seems to achieve the most consistent performance across different zones,
reaching the highest recall in 3 zones – patch (98%), paragraph (98%) and personal signature (86%).
On the other hand, CHIPMUNK achieves the highest recall for quotation (99%) and log data (84%),
QUAGGA also achieves the highest recall for quotation (99%) and mua signature (93%) and, while Tang
et al. (2005) present the lowest all zones accuracy (0.80%) it also reaches 99% recall for quotation.
Regarding the Enron corpus, OKAPI reaches 88% accuracy for all zones, matching, once again,
CHIPMUNK’s performance. QUAGGA reaches an all zones accuracy of 83% and Tang presents the lowest
all zones accuracy with 72%. Even though our model matches CHIPMUNK’s overall accuracy, when
looking at each one of the six most common zones, OKAPI achieves a lower recall. This apparent
60
discrepancy between the overall performance and zone performance results from the high recall
values OKAPI achieves for zones which are not present in the six most common zones, but that are
also substantially represented - salutation (94% recall) visual separator (92%), closing (84%) and
inline headers (82%).
We further analyze how OKAPI adapts to new domains by comparing the performance of
OKAPI against QUAGGA (Repke & Krestel, 2018), when being evaluated in domain adaptation task, i.e.
evaluated in a different corpus than the one used for training. Table 5.9 compiles these results.
OKAPI clearly outperforms QUAGGA, for both two-zone and five-zone schema, indicating a
superior ability to generalize to unseen domains. Moreover, the results indicate that training the
models with the Enron corpus (98% and 93% accuracy) leads to a better generalization than the ASF
corpus (97% and 88% accuracy).
accuracy accuracy
Model Training Corpus Test Corpus
2 zones 5 zones
QUAGGA Enron ASF 0.94 0.86
OKAPI Enron ASF 0.98 0.93
QUAGGA ASF Enron 0.86 0.80
OKAPI ASF Enron 0.97 0.88
Table 5.9 – Comparison of OKAPI and QUAGGA capacity to generalize learnings, for Repke & Krestel
(2018) Enron and ASF corpora.
We evaluate the Inter-annotator agreement for our multilingual corpus. Table 5.10 shows the inter-
annotator agreement scores for each language using Cohen’s kappa coefficient (k), accuracy and f1-
score of one annotator versus the other.
The Inter-annotator agreement scores reveal a high consensus between the annotators
(Landis & Koch, 1977). French is the language that achieves higher scores in all metrics, followed by
Portuguese and Spanish. While values for accuracy and f1-score are generally the same, naturally,
Cohen’s kappa (k) shows lower scores, since this metric penalizes agreement occurring by chance.
61
Compared with previously seen results on other email zoning corpora and zoning schemas, in
particular for the English Gmane corpus and Enron corpus (see Table 5.8), OKAPI achieves competitive
performances for the multilingual corpus, confirming its cross-lingual capacity. For the French and
Spanish sets, OKAPI achieves an overall accuracy of 93%, and 91% for the Portuguese corpus.
Corpus Language
Zone Portuguese Spanish French
All zones 0.91 0.92 0.93
quotation 0.99 0.99 0.99
paragraph 0.91 0.96 0.92
mua signature 0.95 0.82 0.91
personal signature 0.81 0.87 0.79
visual separator 0.92 0.90 0.96
quotation marker 0.55 0.97 0.97
closing 0.59 0.58 0.69
log data 0.56 0.53 0.57
raw code 0.54 0.74 0.84
inline header 0.78 0.77 0.58
salutation 0.65 0.69 0.89
tabular 0.30 0.00 0.60
technical 0.67 0.56 0.48
patch 0.00 0.00 0.00
section header 0.34 0.00 0.00
Table 5.11 – Multilingual zero-shot evaluation of OKAPI trained with Bevendorff et al. (2020) English
Gmane Corpus. Global accuracy (all zones) and each zone recall computed by averaging the scores
for both sets of annotations.
Based on the corpus statistics (see Table 4.6), we believe that the difference in overall
accuracy between languages may be explained by the greater number of lines the Portuguese corpus
has, when compared to the other two, and due to the Portuguese longer emails, which may lead to
more prediction errors on this set.
Furthermore, in general, the model achieves higher recalls for the most represented zones,
such as quotation and paragraph, while failing to predict patch lines, since these rarely appear in the
corpus. This same pattern becomes more obvious when we look at some zones, for example, mua
signature, that has less appearances in Spanish emails (4%), thus having lower recall (82%), but
appears more in Portuguese (12%) and French (9%) emails, having a higher recall of 95% and 91%,
respectively.
62
63
6. CONCLUSIONS
With this research, we developed a full multilingual email zoning solution for a Customer Service
company and enriched the email zoning literature with a new multilingual email zoning corpus and
with the first multilingual email zoning system, competitive with current state-of-the-art solutions for
English emails and that effectively segments emails from unseen languages.
Upon CLEVERLY’s necessity to improve their email classification pipeline efficiency and
performance through the implementation of an effective multilingual email zoner, we collected and
annotated a total of 15,547 tickets, from 14 of CLEVERLY’s clients, in 5 different languages, following
an eight-zone classification schema defined based on the company’s business needs and the
structure of their clients tickets.
We developed five email zoning models based on neural-network architectures and used the
collected tickets to train and test these models for same language prediction and cross-lingual zero-
shot prediction. Our experiments showed that using pre-trained cross-lingual contextual
embeddings, namely XLM-RoBERTa (Conneau et al., 2020) embeddings, leads to performance
improvements, more notably for cross-lingual zero-shot tasks, when compared with the results
obtain using word2vec (Mikolov et al., 2013). Likewise, we demonstrated that a CRF output layer
generally results in better model performance compared to a softmax. This way, the best email
zoning results were achieved with the combination of XLM-RoBERTa embeddings and a BiLSTM-CRF
segmentation module, a system we dubbed OKAPI.
OKAPI email zoning times were recorded, and the system was tested as an upstream email
processor in CLEVERLY’s classification pipeline. Ours tests showed that the performance of the
classification model did not alter, possibly due to its own ability to distinguish relevant content from
noise.
Moreover, we compared OKAPI with other email zoning models from the literature. For
English email zoning, OKAPI attained state-of-the-art performance in Repke & Krestel (2018) corpora,
including in a domain adaptation task. Likewise, for Bevendorff et al. (2020) corpora, our model
outperformed or was competitive with existing approaches. Finally, OKAPI also achieved high
performances in our new multilingual corpus, proving to be able to perform equally well in different
languages.
64
65
7. LIMITATIONS AND RECOMMENDATIONS FOR FUTURE WORKS
One clear limitation of this work, and of the email zoning task in general, is that email formal layout
and, consequently, the email zones present can vary depending on the corpus and domain. This
absence of a fixed email zoning schema can be specially challenging for a startup company like
CLEVERLY, that deals with an increasing client base, and will, inevitably, lead to a continual update of
the 8 zones proposed by us.
Another limitation of this research is that we were not able to compare OKAPI against
CLEVERLY’s current email preprocessing solution in the annotated multilingual corpus we collected for
CLEVERLY. By comparing both solutions, we can measure which one is more effective in the email
zoning task and, eventually, understand the reasons why both approaches result in the same
classifier’s accuracy. Likewise, we were also not able to analyze CLEVERLY’s current time statistics and
compare it with the ones from OKAPI.
For future work, it would be useful to explore other pre-trained embedding types, namely
other multilingual and monolingual transformer-based embeddings, like XLM (Lample & Conneau,
2019) and BERT (Devlin et al., 2019), respectively. This way we could evaluate, for example, how
performance improvement in cross-lingual tasks is related with the cross-lingual characteristics of
these language models or if other captured features are more important.
Another experiment that may leverage model performance is to train the XLM-RoBERTa
(Conneau et al., 2020) embeddings module together with the rest of the network, updating its
weights based on the training data to improve OKAPI’s capacity to extract specific corpus features.
Also, we would like to experiment changing the segmentation module base layer from a BiLSTM to
another network, such as a Transformer (Vaswani et al., 2017), as shown by Lukasik et al. (2020).
Moreover, it would be interesting to extend the discussion about the monolingual and cross-
lingual zero-shot tasks performance broken down by zone. This would be a great away to further
understand why model performance in some zones is much lower than in other zones and why there
is no or little transfer between some languages for particular zones. Even though we know that this
can be partially explained by the fact that some zones are underrepresented, it would be of interest
to know, for example, if language or special indicator characters are used differently between
languages and zones.
Lastly, we would like to keep contributing to the email zoning literature by collecting more
emails for our multilingual corpus of Portuguese, Spanish and French emails and by releasing new
annotated corpora in more languages, resorting, for example, to the Gmane repository, that still
remains almost completely unexplored.
66
67
8. BIBLIOGRAPHY
Almeida, M. S. C., Pinto, C., Figueira, H., Mendes, P., & Martins, A. F. T. (2015). Aligning Opinions:
Cross-Lingual Opinion Mining with Dependencies. Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing, 1, 408–418. https://doi.org/10.3115/v1/P15-1040
Badjatiya, P., Kurisinkel, L. J., Gupta, M., & Varma, V. (2018). Attention-Based Neural Text
Segmentation. Lecture Notes in Computer Science , 10772 LNCS, 180–193.
https://doi.org/10.1007/978-3-319-76941-7_14
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align
and Translate. 3rd International Conference on Learning Representations, ICLR 2015.
Baum, L. E., & Petrie, T. (1966). Statistical Inference for Probabilistic Functions of Finite State Markov
Chains. Ann. Math. Statist., 37(6), 1554–1563. https://doi.org/10.1214/aoms/1177699147
Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical Models for Text Segmentation. Machine
Learning, 34(1), 177–210. https://doi.org/10.1023/A:1007506220214
Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., … Lasserre, J. (2007).
Generative or Discriminative? Getting the Best of Both Worlds. BAYESIAN STATISTICS, 8, 3–24.
Bettenburg, N., Adams, B., Hassan, A. E., & Smidt, M. (2011). A Lightweight Approach to Uncover
Technical Artifacts in Unstructured Data. International Conference on Program Comprehension,
185–188. https://doi.org/10.1109/ICPC.2011.36
Bevendorff, J., Khatib, K. Al, Potthast, M., & Stein, B. (2020). Crawling and Preprocessing Mailing Lists
At Scale for Dialog Analysis. Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics - ACL 2020, 1151–1158. https://doi.org/10.18653/v1/2020.acl-
main.108
Carlson, L., Marcu, D., & Okurowski, M. E. (2003). Building a Discourse-Tagged Corpus in the
Framework of Rhetorical Structure Theory. Current and New Directions in Discourse and
Dialogue, 85–112. https://doi.org/10.1007/978-94-010-0019-2_5
Carvalho, V. R., & Cohen, W. W. (2004). Learning to Extract Signature and Reply Lines from Email.
First Conference on Email and Anti-Spam (CEAS). Retrieved from
https://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
Cauchy, A.-L. (1847). Methode generale pour la resolution des systemes d’equations simultanees.
25(2), 536–538. Retrieved from https://www.mendeley.com/catalogue/7a6d25c8-4c6d-33e3-
8855-
6963ff320b5d/?utm_source=desktop&utm_medium=1.19.4&utm_campaign=open_catalog&us
erDocumentId=%7B575ee8df-eb07-4ee1-bd71-0a99f97c720d%7D
Chen, H., Hu, J., & Sproat, R. W. (1999). Integrating geometrical and linguistic analysis for email
signature block parsing. ACM Transactions on Information Systems, 17(4), 343–366.
https://doi.org/10.1145/326440.326442
Chen, M. X., Cao, Y., Tsay, J., Chen, Z., Lee, B. N., Zhang, S., … Wu, Y. (2019). Gmail smart compose:
Real-time assisted writing. Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2287–2295. https://doi.org/10.1145/3292500.3330723
68
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y.
(2014). Learning phrase representations using RNN encoder-decoder for statistical machine
translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing ({EMNLP}), 1724–1734. https://doi.org/10.3115/v1/D14-1179
Choi, F. Y. Y. (2000). Advances in domain independent linear text segmentation. 6th Applied Natural
Language Processing Conference - ANLP 2000, 26–33. Retrieved from
https://www.aclweb.org/anthology/A00-2004/
Christidis, P., & Losada, Á. G. (2019). Email Based Institutional Network Analysis: Applications and
Risks. The Social Sciences, 8(11), 306. https://doi.org/10.3390/socsci8110306
Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological
Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Collins, M. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and
Experiments with Perceptron Algorithms. Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing - EMNLP 2002, 1–8.
https://doi.org/10.3115/1118693.1118694
Collobert, R., & Weston, J. (2008). A Unified Architecture for Natural Language Processing: Deep
Neural Networks with Multitask Learning. Proceedings of the 25th International Conference on
Machine Learning - ICML ’08, 160–167. https://doi.org/10.1145/1390156.1390177
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … Stoyanov, V.
(2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, 8440–8451.
https://doi.org/10.18653/v1/2020.acl-main.747
Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20, 273–297.
https://doi.org/10.1007/BF00994018
Coussement, K., & den Poel, D. Van. (2008). Improving customer complaint management by
automatic email classification using linguistic style features as predictors. Decision Support
Systems, 44(4), 870–882. https://doi.org/https://doi.org/10.1016/j.dss.2007.10.010
Curry, H. B. (1944). The method of steepest descent for non-linear minimization problems. Quarterly
of Applied Mathematics, 2(3), 258–261. Retrieved from http://www.jstor.org/stable/43633461
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics, 1, 4171–4186.
https://doi.org/10.18653/v1/n19-1423
Estival, D., Gaustad, T., Pham, S. B., Radford, W., & Hutchinson, B. (2007). Author profiling for English
emails. Proceedings of the 10th Conference of the Pacific Association for Computational
Linguistics, 263–272.
Fix, E., & Hodges, J. L. (1989). Discriminatory Analysis. Nonparametric Discrimination: Consistency
Properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238–
247. Retrieved from http://www.jstor.org/stable/1403797
69
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of
pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202.
https://doi.org/10.1007/BF00344251
Gage, P. (1994). A New Algorithm for Data Compression. C Users J., 12(2), 23–38.
Glavaš, G., & Somasundaran, S. (2020). Two-level transformer and auxiliary coherence modeling for
improved text segmentation. The ThirtyFourth AAAI Conference on Artificial Intelligence (AAAI-
20), 2306–2315. https://doi.org/10.1609/aaai.v34i05.6284
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Networks, 18(5), 602–610.
https://doi.org/10.1016/j.neunet.2005.06.042
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–
1780. https://doi.org/10.1162/neco.1997.9.8.1735
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv prep.
Retrieved from http://arxiv.org/abs/1508.01991
Jardim, B., Rei, R., & Almeida, M. S. C. (2021). Multilingual Email Zoning. EACL 2021 Student Research
Wrokshop. Retrieved from https://arxiv.org/abs/2102.00461
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification.
Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics., 2, 427–431. Retrieved from
https://www.aclweb.org/anthology/E17-2068
Kiefer, J., & Wolfowitz, J. (1952). Stochastic Estimation of the Maximum of a Regression Function. The
Annals of Mathematical Statistics, 23(3), 462–466. Retrieved from
http://www.jstor.org/stable/2236690
Kim, Y., Jernite, Y., Sontag, D. A., & Rush, A. M. (2016). Character-Aware Neural Language Models.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2741–2749. Retrieved
from http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12489
Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In Y. Bengio & Y. LeCun
(Eds.), 3rd International Conference on Learning Representations. Retrieved from
http://arxiv.org/abs/1412.6980
Klimt, B., & Yang, Y. (2004). The Enron Corpus: A New Dataset for Email Classification Research.
Lecture Notes in Artificial Intelligence, 217–226. https://doi.org/10.1007/978-3-540-30115-8_22
Kocayusufoglu, F., Sheng, Y., Vo, N., Wendt, J., Zhao, Q., Tata, S., & Najork, M. (2019). RiSER: Learning
Better Representations for Richly Structured Emails. The World Wide Web Conference, 886–895.
https://doi.org/10.1145/3308558.3313720
Koehler, J., Fux, E., Herzog, F. A., Lötscher, D., Waelti, K., Imoberdorf, R., & Budke, D. (2018). Towards
intelligent process support for customer service desks: Extracting problem descriptions from
noisy and multi-lingual texts. Lecture Notes in Business Information Processing, 308, 36–52.
70
https://doi.org/10.1007/978-3-319-74030-0_3
Koshorek, O., Cohen, A., Mor, N., Rotman, M., & Berant, J. (2018). Text Segmentation as a Supervised
Learning Task. NAACL HLT 2018 - 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies - Proceedings of the
Conference, 2, 469–473. https://doi.org/10.18653/v1/n18-2075
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with
Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), 66–75. https://doi.org/10.18653/v1/P18-
1007
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth
International Conference on Machine Learning, 282–289. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc.
Lampert, A., Dale, R., & Paris, C. (2009). Segmenting email message text into zones. Proceedings of
the 2009 Conference on Empirical Methods in Natural Language Processing, 919–928.
Lampert, A., Dale, R., & Paris, C. (2010). Detecting Emails Containing Requests for Action. Human
Language Technologies: The 2010 Annual Conference of the North American Chapter of the
Association for Computational Linguistics, 984–992. Retrieved from
https://www.aclweb.org/anthology/N10-1142
Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. ArXiv. Retrieved from
http://arxiv.org/abs/1901.07291
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data.
Biometrics, 33(1), 159–174. Retrieved from http://www.jstor.org/stable/2529310
Lang, K. (1995). NewsWeeder: Learning to Filter Netnews. Proceedings of the Twelfth International
Conference on International Conference on Machine Learning, 331–339.
https://doi.org/https://doi.org/10.1016/B978-1-55860-377-6.50048-7
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989).
Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–
551. https://doi.org/10.1162/neco.1989.1.4.541
Li, J., Sun, A., & Joty, S. R. (2018). SegBot: A Generic Neural Text Segmentation Model with Pointer
Network. Proceedings of the Twenty-Seventh International Joint Conference on Artificial
Intelligence, 4166–4172. https://doi.org/10.24963/ijcai.2018/579
Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numerical
Mathematics, 16(2), 146–160. https://doi.org/10.1007/BF01931367
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: A Robustly
Optimized BERT Pretraining Approach. CoRR, abs/1907.1. Retrieved from
http://arxiv.org/abs/1907.11692
Lukasik, M., Dadachev, B., Papineni, K., & Simões, G. (2020). Text Segmentation by Cross Segment
Attention. Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 4707–4716. https://doi.org/10.18653/v1/2020.emnlp-main.380
Luong, T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural
71
Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, 1412–1421. https://doi.org/10.18653/v1/D15-1166
Markov, A. A. (1953). The Theory of Algorithms. Journal of Symbolic Logic, 18(4), 340–341.
https://doi.org/10.2307/2266585
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The
Bulletin of Mathematical Biophysics, 5(4), 115–133. https://doi.org/10.1007/BF02478259
Mikolov, Tomas, Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word
Representations in Vector Space. In Y. Bengio & Y. LeCun (Eds.), 1st International Conference on
Learning Representations. Retrieved from http://arxiv.org/abs/1301.3781
Mikolov, Tomas, Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations
of Words and Phrases and their Compositionality. Proceedings of the 26th International
Conference on Neural Information Processing Systems, 2, 3111–3119.
Mikolov, Tomáš, Sutskever, I., Deoras, A., Le, H.-S., Kombrink, S., & Cernocky, J. (2012). Subword
language modeling with neural networks. In Unpublished.
Mosteller, F., & Tukey, J. W. (1968). Data analysis, including statistics. Handbook of Social Psychology,
2, 80–203.
Nießen, S., & Ney, H. (2000). Improving SMT Quality with Morpho-Syntactic Analysis. Proceedings of
the 18th Conference on Computational Linguistics - Volume 2, 1081–1085.
https://doi.org/10.3115/992730.992809
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,
1532–1543. https://doi.org/10.3115/v1/d14-1162
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
Contextualized Word Representations. Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2227–2237. https://doi.org/10.18653/v1/n18-1202
Proskurnia, J., Cartright, M.-A., Garcia-Pueyo, L., Krka, I., Wendt, J. B., Kaufmann, T., & Miklos, B.
(2017). Template Induction over Unstructured Email Corpora. Proceedings of the 26th
International Conference on World Wide Web, 1521–1530.
https://doi.org/10.1145/3038912.3052631
Qaroush, A., Khater, I. M., & Washaha, M. (2012). Identifying spam e-mail based-on statistical header
features and sender behavior. ACM International Conference Proceeding Series, 771–778.
https://doi.org/10.1145/2381716.2381863
Reimers, N., & Gurevych, I. (2020). Sentence-BERT: Sentence embeddings using siamese BERT-
networks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language
Processing and 9th International Joint Conference on Natural Language Processing, Proceedings
of the Conference, 3982–3992. https://doi.org/10.18653/v1/d19-1410
Repke, T., & Krestel, R. (2018). Bringing back structure to free text email conversations with recurrent
neural networks. Advances in Information Retrieval, 114–126. https://doi.org/10.1007/978-3-
319-76941-7_9
Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical
72
Statistics, 22(3), 400–407. https://doi.org/10.1214/aoms/1177729586
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological Review, 65 6, 386–408.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating
errors. Nature, 323, 533–536.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with
Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics, Volume 1: Long Papers, 1715–1725. https://doi.org/10.18653/v1/p16-1162
Sneiders, E. (2016). Review of the main approaches to automated email answering. Advances in
Intelligent Systems and Computing, 444, 135–144. https://doi.org/10.1007/978-3-319-31232-
3_13
Søgaard, A., Ruder, S., & Vulić, I. (2018). On the limitations of unsupervised bilingual dictionary
induction. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics,
Proceedings of the Conference (Long Papers), 1, 778–788. https://doi.org/10.18653/v1/p18-
1072
Srivastava, N., Hinton, G., Krizhevsky, A., & Salakhutdinov, R. (2014). Dropout: A Simple Way to
Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1), 1929–
1958.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks.
Proceedings of the 27th International Conference on Neural Information Processing Systems -
Volume 2, 3104–3112. Cambridge, MA, USA: MIT Press.
Tang, J., Li, H., Cao, Y., & Tang, Z. (2005). Email data cleaning. Proceedings of the Eleventh ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 489–498.
https://doi.org/10.1145/1081870.1081926
Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. ACL 2019 - 57th
Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference,
4593–4601. https://doi.org/10.18653/v1/p19-1452
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017).
Attention is All You Need. Advances in Neural Information Processing Systems, 30, 5998–6008.
Vinyals, O., Fortunato, M., & Jaitly, N. (2015). Pointer Networks. Proceedings of the 28th International
Conference on Neural Information Processing Systems, 2, 2692–2700.
Wang, Y., Li, S., & Yang, J. (2018). Toward fast and accurate neural discourse segmentation.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2018, 962–967. https://doi.org/10.18653/v1/d18-1116
Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Science. Harvard University.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., … Rush, A. (2020). Transformers:
State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations, 38–45.
https://doi.org/10.18653/v1/2020.emnlp-demos.6
73
Yan, Y., Rosales, R., Fung, G., Subramanian, R., & Dy, J. (2014). Learning from multiple annotators
with varying expertise. Machine Learning, 95(3), 291–327. https://doi.org/10.1007/s10994-013-
5412-1
Zhang, X., Wei, F., & Zhou, M. (2020). Hibert: Document level pre-training of hierarchical bidirectional
transformers for document summarization. ACL 2019 - 57th Annual Meeting of the Association
for Computational Linguistics, Proceedings of the Conference, 5059–5069.
https://doi.org/10.18653/v1/p19-1499
Zhou, M., Duan, N., Liu, S., & Shum, H.-Y. (2020). Progress in Neural NLP: Modeling, Learning, and
Reasoning. Engineering, 6(3), 275–290.
https://doi.org/https://doi.org/10.1016/j.eng.2019.12.014
Hochreiter, Sepp & Schmidhuber, Jürgen. (1997). Long Short-term Memory. Neural computation. 9.
1735-80. 10.1162/neco.1997.9.8.1735.
74
75
Page | i