2023-Hierarchical Attention Network For Multivariate Time Series

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Applied Intelligence (2023) 53:5060–5071

https://doi.org/10.1007/s10489-022-03825-5

Hierarchical attention network for multivariate time series


long-term forecasting
Hongjing Bi 1 & Lilei Lu 1 & Yizhen Meng 1

Accepted: 27 May 2022 / Published online: 17 June 2022


# The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
Multivariate time series long-term forecasting has always been the subject of research in various fields such as economics,
finance, and traffic. In recent years, attention-based recurrent neural networks (RNNs) have received attention due to their ability
of reducing error accumulation. However, the existing attention-based RNNs fail to eliminate the negative influence of irrelevant
factors on prediction, and ignore the conflict between exogenous factors and target factor. To tackle these problems, we propose a
novel Hierarchical Attention Network (HANet) for multivariate time series long-term forecasting. At first, HANet designs a
factor-aware attention network (FAN) and uses it as the first component of the encoder. FAN weakens the negative impact of
irrelevant exogenous factors on predictions by assigning small weights to them. Then HANet proposes a multi-modal fusion
network (MFN) as the second component of the encoder. MFN employs a specially designed multi-modal fusion gate to
adaptively select how much information about the expression of current time come from target and exogenous factors.
Experiments on two real-world datasets reveal that HANet not only outperforms state-of-the-art methods, but also provides
interpretability for prediction.

Keywords Multivariate time series . Hierarchical attention . Deep neural network . Long-term forecasting . Multi-modal fusion

1 Introduction long-horizon time series [11]. In addition, most of these


methods solve the one-step ahead forecasting problem, which
Multivariate time series (MTS) forecasting aims to study how has limitations in practical applications. On the contrary, the
to predict the future target based on historical target and ex- long-term forecasting is more meaningful in actual applica-
ogenous series [1]. In recent years, MTS forecasting has been tions, because one-step forecasting is difficult to provide a
widely used in many fields, such as traffic flow forecasting decision basis for the situation after multiple steps [12].
[2], pedestrian behavior prediction [3], time series anomaly Recently, recurrent neural networks (RNNs) and its variants,
detection [4], and natural disaster forecasting [5]. including long short-term memory network (LSTM) [13] and
Undoubtedly, accurate prediction can help society to operate gated recurrent unit (GRU) [14], have been widely used for
effectively in many aspects. Many scholars devoted to devel- modeling complex time series data, such as stock price pre-
oping time series prediction methods, especially based on diction [15] and cyber-physical systems [16]. Among these
classical statistical methods such as autoregressive compre- applications, attention-based RNNs are particularly attractive
hensive moving average (ARIMA) [6] and Holt-winter [7], to time series forecasting [17]. The attention-based RNNs uti-
as well as typical machine learning models like support vector lize two independent RNNs to encode sequential inputs into
regression (SVR) [8], deep belief network (DBN) [9], and context vectors and to decode these contexts into desired in-
random forests(RF) [10]. However, the above methods lack terpretations [18]. Attention is the information selection strat-
effective processing of the temporal dependence among the egy proposed by Bahdanau et al. [19], which allows items
input series, and are difficult to cope with the evolution of from encoder hidden states can be selected by their impor-
tance for decoder. Qin et al. [20] added an input attention to
the encoder for capturing variables interactions. Liu et al. [21]
* Hongjing Bi
[email protected]
developed a dual-stage two-phase attention-based RNN mod-
el for multivariate time series long-term forecasting.
1
Although the attention-based RNNs have achieved encour-
Department of Computer Science, Tangshan Normal University,
aging performance in MTS long-term prediction, they have
Tangshan, Hebei 063000, People’s Republic of China

13
Hierarchical attention network for multivariate time series long-term forecasting 5061

several general drawbacks. Firstly, the multivariate time se- long-term prediction, and outperforms the state-of-the-art
ries data provide target factor and a variety of exogenous methods.
factors, but some exogenous factors present weak correlation
with prediction task. For instance, crude oil prices have strong
correlation with gasoline prices, but have a weak contribution
to wood prices [22]. However, the RNN-based encoder blends
the information of all factors into a hidden state for prediction. 2 Related work
Secondly, these methods ignore the information conflict be-
tween target factor and exogenous factors, which will damage Time series forecasting is an important field of academic research
the accuracy of the model. Li et al. [15] found that linking and forms part of applications such as natural disaster forecasting
some of the factors together did not help the stock price pre- [5], medical diagnosis [23], traffic flow prediction [24], and stock
diction, and even reduced the model accuracy. market analysis [25]. The statistical methods, such as
To tackle these challenges, we propose a novel Hierarchical autoregressive integrated moving average ARIMA [6] and
Attention Network (HANet) for multivariate time series long- Holt-Winters [7] are two widely used models for time series
term forecasting. In particular, HANet is a hierarchically struc- forecasting. However, these methods only focus on univariate
tured encoder-decoder architecture, which learns both the im- time series and assume the series is stationary, while practical
portance factors and the long distance temporal dependence. data normally do not meet this constraint [26]. The machine
Meanwhile, HANet employs a specially designed multi- learning methods, such as support vector regression (SVR) [8]
modal fusion network (MFN) to eliminate the information con- and random forest [10], are also important components of time
flict between target factor and exogenous factors. Specifically, series prediction models. Although these methods can capture
we design a factor-aware attention network (FAN) as the first the interaction among features better, they are difficult to cope
component of the encoder to eliminate the negative influence of with the evolution of long-horizon time series due to ignore time
irrelevant exogenous factors. FAN assigns appropriate weights dependence [11]. In recent years, deep neural networks have
to each factor in the exogenous series and converts the sequence been successfully applied to time series forecasting. For example,
into high-level semantics. Thus, FAN limits the contribution of Qin et al. [27] combined ARIMA and deep belief network
irrelevant factors to high-level semantics by assigning small (DBN) to predict the occurrence of red tides. Shin et al. [28]
weights to them. To address the second challenge, we carefully proved that deep neural networks had better generalization ability
designed a multi-modal fusion network (MFN) as the second and higher prediction accuracy than traditional shallow neural
component of the encoder. In the encoding stage, MFN trades networks. Moreover, recurrent neural networks (RNNs), espe-
off how much new information the network is considered from cially its variants LSTM [29] and gated recurrent unit (GRU)
target and exogenous factors through specially designed multi- [30], are widely used in various time series forecasting tasks,
modal fusion gate. In addition, we introduce a temporal atten- such as traffic flow forecasting [2], stock price forecasting [15],
tion between the encoder and decoder network, which can and COVID-19 prediction [23].
adaptively select relevant encoder input items across time steps Taieb et al. [31] reviewed the existing long-term forecasting
for improving forecasting accuracy. The main contributions of strategies, namely recursive, direct, and multilevel strategies.
this study are as follows: Encoder-decoder is a seq2seq model proposed by Sutskever
et al. [32], which utilizes two independent RNNs to encode se-
1. We propose a novel Hierarchical Attention Network for quential inputs into context vectors and decode these contexts
multivariate time series long-term forecasting. As a hier- into desired sequence. Many scholars devoted to developing
archically structured neural network, HANet learns both multivariate time series long-term forecasting methods with var-
the importance factors and the long distance temporal ious encoder-decoder architectures [33]. For instance, Kao et al.
dependence. Meanwhile, HANet can alleviate the infor- [5] used the encoder-decoder model for multi-step flood forecast-
mation conflict between target and exogenous factors. ing and achieved good results. However, the performance of the
2. We design a factor-aware attention network (FAN) to encoder-decoder decreases with the increasing of the input se-
eliminate the negative influence of irrelevant exogenous quence because the encoder compresses the input into a fixed
factors. FAN limits the contribution of irrelevant factors vector. Fortunately, the attention mechanism can solve this chal-
to prediction by assigning small weights to them. lenge. Attention is a soft selection strategy of information, which
3. We introduce a multi-modal fusion network (MFN). allows items from encoder hidden states can be selected by their
MFN can alleviate the information conflict between target importance to decoder [34]. Therefore, attention-based RNNs
and exogenous factors through specially designed multi- further stimulated the related research on time series long-term
modal fusion gate. prediction. Liu et al. [21] developed a dual-stage two-phase
4. Experiments on two fields of air quality and ecological attention-based RNN (DSTP-RNN) for multivariate time series
datasets show that HANet is very effective in time series long-term forecasting. The model not only took into account the

13
5062 H. Bi et al.

factors’ spatial correlation, but also the time dependencies among hierarchical structure that consists of factor-aware attention
the series. network (FAN), multi-modal fusion network (MFN), and
LSTM. FAN is executed on the exogenous series with the
purpose of eliminating the negative effects of irrelevant fac-
3 Problem definition and notations tors. Specifically, FAN assigns different weights to xkt ∈xk (1
≤ t ≤ T) according to its importance to the prediction, and
Multivariate time series (MTS) long-term forecasting aims to summarizes exogenous series xt= {x1t ; x2t ; …; xnt } into a high-
study how to predict the future multiple time steps’ target level semantic e xt . Thus, FAN limits the contribution of irrel-
series based on historical target and exogenous series. Given evant factors to high-level semantics by assigning small
n (n ≥ 1) exogenous series and one target series, we utilize weights to them. Subsequently, the high-level semantics e xt
 
symbol xk= xk1 ; xk2 ; …; xkT ∈ℝT to represent k-th exogenous are fed into LSTM, and generate a hidden state hxt . The hidden
series within the length of window size T, and we use symbols state hxt is entered into the MFN along with the target factoryt.
fxt gTt¼1 = {x1, x2, …, xT} and fyt gTt¼1 = {y1, y2, …, yT} to MFN trades off how much new information the network is
denote historical all exogenous and target series in T time considered from target factor yt and hidden state hxt through
slice. The symbol xt (1 ≤ t ≤ T) is a vector, wherext= specially designed multi-modal fusion gate. The decoder is
{x1t ; x2t ; …; xnt }∈ℝn and n is the number of exogenous factors composed of LSTM and a single layer multilayer perceptron,
at time t. The symbol yt ∈ ℝ is the target factor at time t. which is located on the top of the encoder. Furthermore, we
Obviously, the output is an estimate of the target factor for design a temporal attention (TA) and place it between the
T þΔ encoder and the decoder. TA acts as a bridge between the
Δ time steps after T, denotes as fbyt gt¼T þ1 = {b
yT þ1 ,byT þ2 ,…,
decoder unit i and the encoder. The function of this bridge is
byT þΔ }, where Δ (Δ ≥ 1) is a variable according to the
to select the most relevant information in the encoder for pre-
demand of the task. To sum up, HANet is to learn a nonlinear diction. Finally, HANet leverages a single-layer multilayer to
mapping from exogenous series fxt gTt¼1 and target series convert the state di into the predicted value byi (1 ≤ i ≤ Δ).
fyt gTt¼1 in the history to the estimation of the future value
T þΔ
fbyt gt¼T þ1 : 4.2 Encoding process
  f ðÞ n oTþΔ
fxt gTt¼1 ; fyt gTt¼1 → byt ð1Þ As shown in Fig. 1, the encoder is a hierarchically structured
t¼Tþ1
network which consists of a factor-aware attention network
Where f(·) is the non-linear mapping function. (FAN), a multi-modal fusion network (MFN), and a long
short-term memory network (LSTM). In the coding phase,
FAN and MFN play different roles. The former eliminates
4 Proposed HANet model the negative effects of irrelevant factors, while the latter
balances the information conflict between the target and
In this section, we introduce the proposed HANet model for exogenous factors. Next, we describe the encoding process
multivariate time series long-term forecasting problem. We in detail.
first present an overview of the model. Subsequently, we de-
tail the model with all components. Standard LSTM LSTM has already become an extensible and
useful model to address the problem of learning sequential
data [13]. LSTM contains a memory cell ct and three gates,
4.1 An overview of HANet
i.e., forget gate ft, input gate it, and output gate ot. The three
gates adjust the information flow into and out of the cell. In
As mentioned in section 1, there are some problems to be
this work, we use LSTM as the first layer of FAN. Given the
solved in the long-term forecasting of multivariate time series:
1) eliminate the negative influence of irrelevant exogenous input series fxt gTt¼1 = {x 1 , x 2 , …, x T }, where x t =
factors and 2) balance the information conflict between target {x1t ; x2t ; …; xnt }∈ℝn is a vector with n exogenous factors at
and exogenous factors. Moreover, a successful time series time t. LSTM is applied to learn a mapping from xt to hidden
long-term forecasting method should be able to capture the state ht. The calculation is defined in eq. (2).
long-term dependence between sequences. Therefore, we pro-
it ¼ σðWi xt þ Ui ht−1 þ bi Þ 
pose a Hierarchical Attention Network (HANet) for multivar-
f t ¼ σ W f xt þ U f ht−1 þ b f
iate time series long-term forecasting. The architecture of ot ¼ σðWo xt þ Uo ht−1 þ bo Þ ð2Þ
HANet is shown in Fig. 1. ct ¼ f t ⊙ct−1 þ it ⊙tanhðWc xt þ Uc ht−1 þ bc Þ
As shown in Fig. 1, HANet is a hierarchically structured ht ¼ ot ⊙tanhðct Þ
encoder-decoder model. Similarly, the encoder is also a

13
Hierarchical attention network for multivariate time series long-term forecasting 5063

Fig. 1 A graphical illustration of Hierarchical Attention Network {x1t ; x2t ; …; xnt }∈ℝn is a vector with n exogenous factors at time t.
(HANet). The encoder of HANet consists of three components, i.e., MFN is the multi-modal fusion network, which generates a hidden rep-
factor-aware attention network (FAN), multi-modal  fusion network resentation zt by fusing target factor yt and hidden state hxt . di is the hidden
(MFN), and LSTM. Here, xk ¼ xk1 ; xk2 ; …; xkT ∈ℝT is k-th exogenous state of decoder unit i, coi is the context vector generated by temporal
series within the length of window size T. yt is the target factor at time step attention. b yi is the predicted value
t. e
xt is the high-level semantic representation of x t , where x t =

Where the symbol ⊙ is the element-wise product, and σ = 1/(1 corresponding to xt. Typically, given the k-th attribute vector
+ e−x) is the sigmoid function. ht ∈ ℝm is the hidden state of of any exogenous series at time t (i.e., xk), we can employ the
LSTM at time t. m is the size of LSTM hidden unit. The symbols following attention mechanism:
W∗ ∈ ℝm × m, U∗ ∈ ℝm × m, and b∗ ∈ ℝmare learned during  
the training process. Obviously, the output of all three gates is ekt ¼ vT tanh We hxt−1 þ Ue xk
between 0 and 1 after sigmoid function. Hence, if ft is approxi- exp ekt
αkt ¼ n ð3Þ
mately 1 and it approaches 0. The previous memory cell ct − 1  
∑ exp em t
can be saved and passed to the current time step. Similarly, if ot is m¼1
approximately 1, we pass the information of ct to ht. It means the
Where v ∈ ℝT,We ∈ ℝT × 2m, and Ue ∈ ℝT × Tare learnable
hidden state ht captures and retains the input sequences’ historical
parameters, and (·)Τstands for matrix transpose. Here, hxt−1 ∈
information to the current time step.
ℝ m is the hidden state of LSTM at time step t, and m is the size
of LSTM unit. The attention weights are determined by his-
Factor-Aware Attention As shown in eq. (2), LSTM blindly
torical hidden state and current input, which represent the
blend the information of all factors into a hidden state for
impact of each exogenous factor on forecasting. With these
prediction. Therefore, the hidden state incorporates negative
attention weights, we can adaptively extract the important
information about irrelevant factors. However, real life expe-
exogenous series:
rience shows that the contribution of each factor is different
for prediction. Hence, we proposed a factor-aware attention  
e
xt ¼ α1t x1t ; α2t x2t ; …; αnt xnt ð4Þ
network. The factor-aware attention network is composed of
two layers feed forward neural network. In first layer, we Then the hidden state at time t can be updated as:
assign appropriate weights to each factor in the exogenous  
series at time t. In second layer, we aggregate these hidden hxt ¼ LSTM hxt−1 ; e
xt ð5Þ
features to generate a high-level semantic e xt ∈ℝ m

13
5064 H. Bi et al.

Where LSTM(·) is an LSTM unit that can be computed ac- encoder, previous prediction value, and the current decoder
cording to eq. (2), xt is replaced by the newly computed e xt . unit through temporal attention and LSTM. In the second
The symbol hxt ∈ℝ m is the output of factor-aware attention at phase, the decoder unit maps the hidden state di(1 ≤ i ≤ Δ)
time step t. into the final prediction result byi .

Multi-modal fusion network (MFN) To alleviate the informa- Temporal attention The core component of temporal attention
tion conflict between target and exogenous factors, we design is the attention layer. The input of the attention layer consists
a multi-modal fusion network, as shown in Fig. 1. In our of the encoder hidden state sequence fht gTt¼1 = (h1, h2, …, hT)
opinion, target factor information is the most important feature and the hidden state of the decoder. For the convenience of
and cannot be ignored. The exogenous factors are auxiliary description, we use symbol di ∈ ℝm to represent the hidden
information when we are trying to understand the dynamics of state of decoder unit i. For each decoder unit i, attention
the target factor. Hence, in our model, we obtain the hidden returns the corresponding context vector coi. The context vec-
representation of the target factor yt by fusing high-level se- tor coi is the weighted sum of all hidden states of the encoder
mantics. Because the hidden state hxt more or less contains and their corresponding attention weights, as illustrated in
noise information, we use a multi-modal fusion gate to com- Fig. 2.
bine features from different signals, thus better represent the In particular, given encoder hidden state sequence fht gTt¼1
information that are needed to solve a particular problem. The = (h1, h2, …, hT) and previous decoder hidden state di − 1, the
multi-modal fusion gate is a scalar in the range of [0, 1]. The attention returns an importance score to ht(1 ≤ t ≤ T). Then
multi-modal fusion gate is 1 when the hidden state hxt is help- the softmax function transforms the importance score to an
ful to improve the representation of target factor yt, otherwise, attention weight. The attention weight measures the similarity
the value of the gate is 0. The fusion process can be calculated between ht and di − 1. It also means the importance of ht to di
by eq. (6).
− 1. The attention mechanism’s output is the weighted sum of
  all hidden state in fht gTt¼1 , where the weight is the attention
st ¼ σ W x
 y yt :Wx hxt
ut ¼ st ⊙ tanh Ux ht  ð6Þ weight. The above process can be expressed by eq. (8).
zt ¼ Ws Wy yt : ut
eti ¼ vT tanhðWht þ Udi−1 Þ
Where [:] denotes concatenation, the symbols Wx, Wy ∈ ℝ m exp eti
β ti ¼ T
× m
are learnable parameters. The information matrix Ux ∈ ð8Þ
∑ expðeki Þ
ℝ2m × m converts the vector hxt ∈ℝ m into a 2 m-dimensional k¼1
vector. The value of st(st ∈ ℝ2m) is mapped to the interval [0, T
coi ¼ ∑ β ti ht
1] by logistic sigmoid function.⊙is element-wise multiplica- t¼1
tion. Obviously, the multi-modal fusion gate will ignore the i-
dimensional information of Ux hxt when sit = 0 (1 ≤ i ≤ 2m). Where the symbol (∗)Τ means the transpose of matrix. di − 1
∈ ℝm is the hidden state of decoder unit i − 1. eti is the
The symbol Ws ∈ ℝm × 3m is a parameter matrix. In the end,
we fuse hidden state hxt and yt into the representation zt ∈ ℝm. importance score. β ti is the attention weight. coi ∈ ℝm is the
To model the temporal dependency of fusion feature se- context vector.
quences fzt gTt¼1 = (z1, z2, …, zT), we utilize LSTM via the
following eq. (7).

it ¼ σðWi zt þ Ui ht−1 þ bi Þ 
f t ¼ σ W f zt þ U f ht−1 þ b f
ot ¼ σðWo zt þ Uo ht−1 þ bo Þ ð7Þ
ct ¼ f t ⊙ct−1 þ it ⊙tanhðWc zt þ Uc ht−1 þ bc Þ
ht ¼ ot ⊙tanhðct Þ

Where ht ∈ ℝm is the LSTM hidden state at time step t, which


is a vector with m dimension. W∗ ∈ ℝm × m,U∗ ∈ ℝm × m,and
b∗ ∈ ℝm are the learnable parameters.

4.3 Decoding process

Similarly, the decoding process is divided into two stages. In


the first phase, it establishes the temporal correlation among Fig. 2 Temporal attention

13
Hierarchical attention network for multivariate time series long-term forecasting 5065

Subsequently, we use context vector coi to update the hid- series has 11 factors: Chlorophyll (Chl), Sea surface tempera-
den state of LSTM. Specifically, for decoder unit i, LSTM ture (Temp), dissolved oxygen (DO), Saturated dissolved ox-
obtains its hidden state di by combining vector coi, previous ygen (SDO), Tide, Air temperature (Air_temp), Standard at-
hidden state di − 1, and predicted value byi−1 . The above pro- mospheric pressure (Press), Turbidity, PH, and two meteorol-
cess can be expressed by Eq. (9) ogy wind, denoted as Wind_u and Wind_v. In this paper,
 h i  chlorophyll concentration is considered as the target factor.
ii ¼ σ W* byi−1 : coi þU* di−1 þb* The dataset is split into training set (90%) and test set (10%)
 h i 
in chronological order.
f i ¼ σ W f byi−1 : coi þU f di−1 þb f
 h i 
oi ¼ σ Wo byi−1 : coi þUo di−1 þbo ð9Þ 5.2 Baselines approaches
 h i 
eci ¼ tanh Wc byi−1 : coi þUc di−1 þbc
Our experiments are divided into two parts. The first part is to
ci ¼ f t ⊙ci−1 þ it ⊙eci compare our model with the previous state-of-the-art models.
di ¼ ot ⊙tanhðci Þ The second part is ablation experiment, which compares our
model with the degrade version of our model. The specific
Where [:] is the concatenation operation, and byi−1 is the pre- descriptions are as follows:
dicted value of the decoder unit i − 1.
Seq2Seq: A model based on encoder-decoder. Kao et al.
Task Learning In this paper, we use a multilayer perceptron as [5] applied the model to multi-step advance flood fore-
the task learning layer of the model with the purpose of cal- casting and achieved good accuracy. Therefore, the ex-
culating the predicted results of decoder unit i. Concretely, the periment uses it as one of the benchmark methods.
predicted value byi is a transformation of di, which can be GED: A based-attention multivariate time series long-
calculated by the following equation: term forecasting model proposed by Xie et al. [17].
Experimental results on Bohai Sea and South China Sea
byi ¼ Wp di ð10Þ
surface temperature data sets showed that its performance
Where Wp ∈ ℝm is a learnable parameter. was better than traditional methods, such as SVR.
STANet: A multivariate time series forecasting method
proposed by He et al. [12]. They employed the model for
multistep-ahead forecasting of chlorophyll. In this work,
5 Experimental results and analyses we implement it as a baseline method.
DA-RNN: DA-RNN is a one-step-ahead time-series fore-
5.1 Datasets casting method proposed by Qin et al. [20]. The model
introduced an attention mechanism for both encoder and
To compare the performance of different models on various decoder. DA-RNN assumed the inputs must be correlated
types of MTS long-term forecasting problems, we use two among the time. Since DA-RNN is a one-step ahead fore-
available actual data sets to evaluate our proposed model casting method, we implement it in the long-term predic-
and baseline method. The datasets used in our experiment tion based on direct strategy.
are described as follows: DSTP-RNN: The long-term forecasting model of multi-
variate time series proposed by Liu et al. [21]. DSTP-
Beijing PM 2.5 data The dataset contains PM2.5 data and the RNN took into account the spatial correlation between
corresponding meteorological data. We select 17,520 time factors and the time dependencies of the series.
series from January 1, 2013 to December 31, 2014. Each time HCA-GRU: A hierarchical attention-based network for
series has 8 factors: dew point, air temperature, standard at- sequential recommendation proposed by Cui et al. [34].
mospheric pressure, wind direction, wind speed, hours of The model combined the long term dependency and
snow, hours of rain, PM2.5 concentration. In this paper, user’s short-term interest. In this work, we set the time
PM2.5 concentration is considered as the target factor. The frame interval to 1, and modify the output layer by map-
dataset is split into training set (80%) and test set (20%) in ping the learned nonlinear combination into a scalar.
chronological order. MsANet: A multivariate time series forecasting model
proposed by Hu et al.[37]. MsNet employed influential
Chlorophyll data The data set is taken from Tongan Bay attention and temporal attention to extract local depen-
(118o12′N,24o43′E). The monitoring period is from January dency patterns among factors and discover long-term pat-
2009 to July 2017. There are 8733 time series. Each time terns of time series trends.

13
5066 H. Bi et al.

Table 1 Performance of different methods compared in Chlorophyll dataset for different predictive horizons

Methods τ= 1 τ= 6 τ= 12 τ= 24

MAE RMSE MAE RMSE MAE RMSE MAE RMSE

Seq2Seq 0.4823 0.7764 0.6439 0.9819 0.7745 1.1408 0.9086 1.2646


DA-RNN 0.4853 0.6659 0.6297 0.8388 0.6183 0.7855 0.8285 1.1767
GED 0.4267 0.6011 0.4977 0.7035 0.5498 0.7663 0.8845 1.1904
HCA-GRU 0.4478 0.6482 0.5082 0.7083 0.5560 0.7805 0.7038 1.1523
DSTP-RNN 0.4311 0.6121 0.5874 0.7425 0.5673 0.8113 0.7221 1.1452
STANet 0.4107 0.5823 0.5077 0.6988 0.5782 0.7694 0.7577 1.1232
MsANet 0.4415 0.6333 0.4986 0.7056 0.5604 0.7739 0.7346 1.1656
HANet 0.3678 0.5689 0.4819 0.6929 0.5565 0.7569 0.6819 1.1064

Besides, we implement one degraded version of our pro- 5.4 Experimental results and analysis
posed model. The degrade version is used for ablation
experiments: To measure the effectiveness of the proposed model in the
long-term prediction of MTS, we adopt root means square
HA-LSTM: In this version, we remove the MFN com- error (RMSE) and mean absolute error (MAE) to assess the
ponent. That is to say, we blend the information of all forecasting performance. They are calculated by the follow-
factors (include target and exogenous factors) into a high- ing:
level semantic through factor-aware attention.
N  

∑ yt −byt 
t¼1
MAE ¼ ð11Þ
N
5.3 Parameters and experimental settings and
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
In this work, we set the learning rate of all methods to uN  2
u
u ∑ yt −byt
0.0001. To maintain the consistency, we use the same size t t¼1
for all LSTMs’ hidden units. We conduct grid search for RMSE ¼ ð12Þ
N
the size of LSTM’s hidden state over {15, 25, 35, 45}. We
set the size of time window to 24, i.e., T=24. For each Where N is the number of samples, yt is the real value, and byt is
sample, the first 24 time series is the input of models. We the corresponding predicted value. The closer to 0 they are,
compare our model with previous state-of-the-art methods the higher the algorithm accuracy the model has.
on two datasets for long-term forecasting tasks. We test all
methods with horizon τ∈{1, 6, 12, 24} to show their Performance analysis In Tables 1 and 2, we show the perfor-
effectiveness. mance of different methods in different prediction horizons.

Table 2 Performance of different methods compared in PM 2.5 dataset for different predictive horizons

Methods τ= 1 τ= 6 τ= 12 τ= 24

MAE RMSE MAE RMSE MAE RMSE MAE RMSE

Seq2Seq 14.3048 24.7627 29.0523 48.6893 45.4879 66.6200 58.7161 81.0068


DA-RNN 12.0539 21.3099 26.3209 40.0097 37.0001 54.4784 51.2154 88.6673
GED 13.6901 22.6873 25.4548 40.3872 39.8709 62.1063 56.3646 78.0019
HCA-GRU 11.1151 18.9156 25.1317 39.2720 37.9419 55.9917 50.0695 70.4730
DSTP-RNN 11.9881 20.4029 26.7392 42.0621 38.4374 56.7373 50.3837 71.3543
STANet 12.0001 19.6192 27.2872 49.8296 38.6140 58.6099 53.2927 74.0792
MsANet 13.0729 20.2910 25.1451 39.2832 37.7276 56.5410 50.3315 73.7386
HANet 11.0054 18.1005 24.6212 39.0167 36.4204 54.8148 48.3800 71.7957

13
Hierarchical attention network for multivariate time series long-term forecasting 5067

Fig. 3 Performance comparisons


among different methods and
different datasets when horizon is
24

The best performance is highlighted in bold. According to each factor. STANet, DA-RNN, and MsANet consider the cor-
Tables 1 and 2, HANet model is most suitable for long-term relation among exogenous factors. DSTP-RNN not only con-
forecasting task of multivariate time series in the mentioned siders the correlation among exogenous factors, but also focus-
models (13 out of 16 cases). However, the performance of all es on the correlation between exogenous and target factors. For
methods decreases with the increasing of horizon. That is to the same purpose, HANet adaptively filters out untrustworthy
say, different model has different negative effects with the in- exogenous factors through factor-aware attention. According to
creasing of horizon. DA-RNN is a one-step ahead forecasting Tables 1 and 2, we find that the attention-based model outper-
model, which needs to train a model for each time step in the forms Seq2Seq in different prediction horizons. Therefore, we
future. In fact, the time series is a time-varying process, and its conclude the attention mechanism can effectively establish
uncertainty increases as the time interval continues to increase. long-range temporal dependencies, thereby improving the per-
Since the negative effect of uncertainty, the performance of formance of the model’s long-term prediction.
DA-RNN decreases with the increase in horizon. On the con-
trary, the other models only train one model to implement Visualization of analysis To further illustrate the performance
multi-step ahead forecasting. However, these models take the of HANet, we visualize the experimental results. For space
previous predicted value as the current input, so the error will reason, we only show the situation of horizon = 24 in Fig.
gradually accumulate as the horizon continues to increase. The 3. We show the MAE of HANet and other state-of-the-art
performance of HANet is better than baseline approaches ac- models at different time steps. Here, the x-axis indicates dif-
cording to Tables 1 and 2. Obviously, the increase of horizon ferent time steps, and the y-axis is the corresponding MAE
has less negative impact on HANet than baseline methods. The value. Intuitively, compared with other baseline approaches,
experimental results show that the performance of GED is low- the visualization results also support the superiority of HANet.
er than other attention-based models. This is because other Furthermore, we see that the performance of Seq2Seq de-
models analyze time series from different perspectives. grades faster than other attention-based methods with the in-
Specifically, HA-LSTM and HCA-GRU attenuate the negative creasing of predicted horizon. This result demonstrates the
effects of irrelevant factors by distinguishing the contribution of effectiveness of introducing an attention mechanism.

Table 3 Ablation experiments in Chlorophyll dataset for different predictive horizons

Methods τ= 1 τ= 6 τ= 12 τ= 24

MAE RMSE MAE RMSE MAE RMSE MAE RMSE

Seq2Seq 0.4823 0.7764 0.6439 0.9819 0.7745 1.1408 0.9086 1.2646


GED 0.4267 0.6011 0.4977 0.7035 0.5498 0.7663 0.8845 1.1904
HA-LSTM 0.4112 0.6065 0.4961 0.7440 0.5684 0.7622 0.7070 1.1314
HANet 0.3678 0.5689 0.4819 0.6929 0.5565 0.7569 0.6819 1.1064

13
5068 H. Bi et al.

Table 4 Ablation experiments in PM 2.5 dataset for different predictive horizons

Methods τ= 1 τ= 6 τ= 12 τ= 24

MAE RMSE MAE RMSE MAE RMSE MAE RMSE

Seq2Seq 14.3048 24.7627 29.0523 48.6893 45.4879 66.6200 58.7161 81.0068


GED 13.6901 22.6873 25.4548 40.3872 39.8709 62.1063 56.3646 78.0019
HA-LSTM 12.4656 21.8553 25.2919 40.7350 37.0144 55.4472 51.5809 72.2052
HANet 11.0054 18.1005 24.6212 39.0167 36.4204 54.8148 48.3800 71.7957

Ablation experiments analysis To study the performance gain of each exogenous factor. The experimental results show that
of HANet’s components, we conduct ablation study by imple- each exogenous factor has a different contribution to a given
menting one degraded version of HANet, i.e., HA-LSTM. In time step (i.e., the x-axis value). In other words, HANet model
HA-LSTM, we remove the MFN component. Therefore, HA- can not only distinguishes the importance of each factor, but
LSTM can be seen as GED adds a FAN module before its also captures its dynamic changes. Moreover, the experimen-
encoder. In Tables 3 and 4, we show the results of ablation tal results show that Sea surface temperature (Temp), air tem-
experiments, and the best performance is highlighted in perature (Air_temp), and PH are the most important factors,
bold. According to Tables 3 and 4, the evaluation results on which are consistent with the studies in [35]. Meanwhile, we
two real datasets show that the prediction performance of HA- also found that the irrelevant factors’ weight of factor-aware
LSTM is better than GED, which prove the effectiveness of attention is approximately 0, such as standard atmospheric
FAN. Besides, the performance of HANet is better than HA- pressure (Press). Obviously, our model is credible and can
LSTM. Obviously, the experimental results show the effec- provide interpretability for the research object.
tiveness of MFN.
Statistical analysis Statistical tests are used to examine the
Attention weight analysis To further analyze HANet, we vi- forecasting performance between HANet and the Baseline
sualize the weight distribution of factor-aware attention, as methods. In Table 5, we show the paired two-tailed t-tests
shown in Fig. 4. For reason of space, we only show the situ- results of all methods. In addition, we calculate and compare
ation of {1,6,12,24}th time step when the horizon is 24 in the average RMSE for different prediction horizon because
chlorophyll dataset. In Fig. 4, the x-axis indicates different the t-test results are easily affected by the sample size. The
time steps, and the y-axis is corresponding attention weight results prove that HANet is superior to other state-of-the-art

Fig. 4 Weight distribution of


factor-aware attention

13
Hierarchical attention network for multivariate time series long-term forecasting 5069

Table 5 Paired 2-tailed t-tests


with HANet. Confidence level = Methods PM2.5 Chlorophyll
0.05
p value t- avg.RMSE p value t- avg.RMSE
statistic statistic

Seq2Seq 0.0000 −4.5627 55.2697 0.0002 −5.8847 1.0409


DA-RNN 0.0053 −2.7922 51.1163 0.0005 −3.0309 0.8667
GED 0.0004 −3.5140 50.7957 0.0000 −4.6342 0.8153
HCA-GRU 0.0498 −1.9619 46.1631 0.0128 −2.4871 0.8223
DSTP-RNN 0.0432 −2.0220 47.6367 0.1268 −1.5271 0.8277
STANet 0.0030 −3.1048 50.5345 0.0010 −3.2949 0.7934
MsAnet 0.0024 −3.0429 47.4635 0.0272 −2.5780 0.8196
HA-LSTM 0.0053 −2.7877 47.5607 0.1064 −1.6150 0.8410
HANet – – 45.9319 – – 0.7813

methods at the 5% statistical significance level on PM2.5 data. their ability of reducing error accumulation. Since RNNs
The paired two-tailed t-test show that certain models (i.e., blindly blend the information of the target and non-
DSTP-RNN and HA-LSTM) are as accurate as HANet on predictive variables into a hidden state for prediction, the
chlorophyll data set. However, the proposed HANet has a existing attention-based RNNs cannot eliminate the negative
smaller average RMSE value according to Table 5. influence of irrelevant factors. Meanwhile these models ig-
Therefore, HANet has better predictive performance. In sum- nore the conflict between target and exogenous factors. In this
mary, HANet provides better predictive performance than oth- work, we propose a Hierarchical Attention Network (HANet)
er state-of-the-art forecasting methods. for multivariate time series long-term forecasting. HANet is a
hierarchically structured encoder-decoder architecture, which
5.5 Limitations of HANet learns both the importance factors and the long-distance tem-
poral dependence. Specifically, we design a factor-aware at-
Although HANet can handle multivariate time series long- tention (FA) as the first component of the encoder to eliminate
term forecasting problem and has a certain interpretability, the negative influence of irrelevant exogenous factors. To ad-
which is proved by the ‘Performance analysis’, ‘Attention dress the second challenge, we carefully develop a multi-
weight analysis’, and ‘Statistical analysis’ parts, it still has modal fusion network (MFN) as the second component of
some limitations: the encoder. In the encoding stage, MFN trades off how much
new information the network is considering from target and
1) To effectively improve multivariate time series long-term exogenous factors through specially designed multi-modal fu-
prediction performance, HANet sacrifices more comput- sion gate. Besides, we introduce a temporal attention between
ing resources compared with other baseline approaches. the encoder and decoder network, which can adaptively select
2) Inspired by human attention mechanism including the relevant encoder input items across time steps for accurate
dual-stage two-phase (DSTP) model and the impact forecasting. Experimental results show that HANet is very
mechanism of exogenous and target factors [36], we de- effective and outperforms the state-of-the-art methods. We
sign the FAN as a component of HANet. However, DSTP also visualize the weight distribution of factor-aware attention.
assumes that the input of the model is time-dependent, but The visualization of the attention weight shows that our model
this assumption may not always be true in practical has great interpretability.
applications. Another challenge for multivariate time series long-term
forecasting is how to maintain the trend consistency between
the prediction series and the original real series. In future
work, we will further focus on maintaining the trend consis-
tency at less computational costs.
6 Conclusion

Multivariate time series long-term forecasting has always


Acknowledgments This work is supported by Science and Technology
been the subject of research in various fields such as econom- Project of Hebei Education Department (Grant No.ZC2021030) and
ics, finance, and traffic. In recent years, attention-based recur- Doctoral Fundation Project of Tangshan Normal University (Grant
rent neural networks (RNNs) have received attention due to No.2022B02).

13
5070 H. Bi et al.

References 20. Qin Y, Song D, Chen H, et al (2017) A Dual-Stage Attention-Based


Recurrent Neural Network for Time Series Prediction[C]. IJCAI
1. Chen T, Yin H, Chen H, Wu L, Wang H, Zhou X, Li X (2018) 21. Liu Y, Gong C, Yang L, Chen Y (2020) DSTP-RNN: a dual-stage
TADA: trend alignment with dual-attention multi-task recurrent two-phase attention-based recurrent neural network for long-term
neural networks for sales prediction. 2018 IEEE international con- and multivariate time series prediction[J]. Expert Syst Appl 143:
ference on data mining (ICDM), 49–58 113082
2. Qu L, Li W, Li W, Ma D, Wang Y (2019) Daily long-term traffic 22. Shih SY, Sun FK, Lee H (2019) Temporal pattern attention for
flow forecasting based on a deep neural network[J]. Expert Syst multivariate time series forecasting[J]. Mach Learn 108(8):1421–
Appl 121:304–312 1441
3. Chen K, Song X, Han D, Sun J, Cui Y, Ren X (2020) Pedestrian 23. Marques G, Agarwal D, de la Torre Díez I (2020) Automated med-
behavior prediction model with a convolutional LSTM encoder– ical diagnosis of COVID-19 through EfficientNet convolutional
decoder[J]. Phys A: Stat Mech Appl 560:125132 neural network[J]. Appl Soft Comput 96:106691
4. Shen L, Li Z, Kwok J (2020) Time series anomaly detection using 24. Huang X, Ye Y, Wang C, Yang X, Xiong L (2021) A multi-mode
temporal hierarchical one-class network[J]. Adv Neural Inf Proces traffic flow prediction method with clustering based attention con-
Syst 33:13016–13026 volution LSTM[J]. Appl Intell:1–14
5. Kao IF, Zhou Y, Chang LC, Chang FJ (2020) Exploring a long 25. Chatzis SP, Siakoulis V, Petropoulos A, Stavroulakis E,
short-term memory based encoder-decoder framework for multi- Vlachogiannakis N (2018) Forecasting stock market crisis events
step-ahead flood forecasting[J]. J Hydrol 583:124631 using deep and statistical machine learning techniques[J]. Expert
6. Hernandez-Matamoros A, Fujita H, Hayashi T, Perez-Meana H Syst Appl 112:353–371
(2020) Forecasting of COVID19 per regions using ARIMA models 26. Yin J, Rao W, Yuan M, et al (2019) Experimental study of multi-
and polynomial functions[J]. Appl Soft Comput 96:106610 variate time series forecasting models[C]. Proceedings of the 28th
7. Syafei AD, Ramadhan N, Hermana J et al (2018) Application of ACM International Conference on Information and Knowledge
Exponential Smoothing Holt Winter and ARIMA Models for Management. 2833–2839
Predicting Air Pollutant Concentrations[J]. EnvironmentAsia 11(3) 27. Qin M, Li Z, Du Z (2017) Red tide time series forecasting by
8. Chen Y, Xu P, Chu Y, Li W, Wu Y, Ni L, Bao Y, Wang K (2017) combining ARIMA and deep belief network[J]. Knowl-Based
Short-term electrical load forecasting using the support vector re- Syst 125:39–52
gression (SVR) model to calculate the demand response baseline 28. Shin Y, Kim T, Hong S, Lee S, Lee EJ, Hong SW, Lee CS, Kim
for office buildings[J]. Appl Energy 195:659–670 TY, Park MS, Park J, Heo TY (2020) Prediction of chlorophyll-a
9. Kuremoto T, Kimura S, Kobayashi K, Obayashi M (2014) Time concentrations in the Nakdong River using machine learning
series forecasting using a deep belief network with restricted methods[J]. Water 12(6):1822
Boltzmann machines[J]. Neurocomputing 137:47–56 29. Sagheer A, Kotb M (2019) Time series forecasting of petroleum
10. Lahouar A, Slama JBH (2017) Hour-ahead wind power forecast production using deep LSTM recurrent networks[J].
based on random forests[J]. Renew Energy 109:529–541 Neurocomputing 323:203–213
11. Yin C, Dai Q (2021) A deep multivariate time series multistep 30. Xue X, Gao Y, Liu M, Sun X, Zhang W, Feng J (2021) GRU-based
forecasting network[J]. Appl Intell 52:1–19 capsule network with an improved loss for personnel performance
12. He X, Shi S, Geng X, Xu L, Zhang X (2021) Spatial-temporal prediction [J]. Appl Intell 51(7):4730–4743
attention network for multistep-ahead forecasting of 31. Taieb SB, Atiya AF (2015) A bias and variance analysis for
chlorophyll[J]. Appl Intell 51:1–13 multistep-ahead time series forecasting[J]. IEEE Trans Neural
13. Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural Netw Learn Syst 27(1):62–76
networks: LSTM cells and network architectures[J]. Neural 32. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learn-
Comput 31(7):1235–1270 ing with neural networks[C]. Advances in neural information pro-
14. Shen G, Tan Q, Zhang H, Zeng P, Xu J (2018) Deep learning with cessing systems. 3104–3112
gated recurrent unit networks for financial sequence predictions[J]. 33. Ma X, He K, Zhang D, Li D (2021) PIEED: position information
Procedia Computer Science 131:895–903 enhanced encoder-decoder framework for scene text recognition[J].
15. Li H, Shen Y, Zhu Y (2018) Stock price prediction using attention- Appl Intell 51(10):6698–6707
based multi-input LSTM[C]. Asian conference on machine learn- 34. Cui Q, Wu S, Huang Y, Wang L (2019) A hierarchical contextual
ing. PMLR. 454–469 attention-based GRU network for sequential recommendation[J].
16. Muralidhar N, Muthiah S, Ramakrishnan N (2019) DyAt Nets: Neurocomputing 358:141–149
Dynamic Attention Networks for State Forecasting in Cyber- 35. Liu X, Feng J, Wang Y (2019) Chlorophyll a predictability and
Physical Systems[C]. IJCAI. 3180–3186 relative importance of factors governing lake phytoplankton at dif-
17. Xie J, Zhang J, Yu J, Xu L (2019) An adaptive scale sea surface ferent timescales[J]. Sci Total Environ 648:472–480
temperature predicting method based on deep learning with atten- 36. Hübner R, Steinhauser M, Lehle C (2010) A dual-stage two-phase
tion mechanism[J]. IEEE Geosci Remote Sens Lett 17(5):740–744 model of selective attention[J]. Psychol Rev 117(3):759–784
18. Lu E, Hu X (2021) Image super-resolution via channel attention
and spatial attention[J]. Appl Intell 9:1–9 Publisher’s note Springer Nature remains neutral with regard to jurisdic-
19. Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation tional claims in published maps and institutional affiliations.
by jointly learning to align and translate, 3rd International
Conference on Learning Representations, ICLR 2015

13
Hierarchical attention network for multivariate time series long-term forecasting 5071

Hongjing Bi She received the Yizhen Meng She received a


Master of Engineering degree in B.E. degree from North China
Computer Applications Univers it y of Sc ience a nd
Technology from Inner Mongolia Technology, Tangshan, China, in
Uni versit y of S cience and 2003, and the M.S. degree in
Technology, China, in 2010. She school of software engineering
is currently a lecturer in the f r o m To n g j i U n i v e r s i t y,
Department of Computer Science, Shanghai, China, in 2009. Now
Tangshan Normal University. Her she is a Lecturer in the Computer
research interests include privacy Science Department of TangShan
protection, data mining and time Normal University, TangShan,
series forecasting. China. Her research interests in-
clude data mining and image pro-
cessing.

Lilei Lu She received the Ph.D. in


Software Engineering from
Beijing University of Posts and
Telecommunications, Beijing,
China and M.S. in Software
Engineering from Peking
University, Beijing, China. She is
currently an Associate Professor
at the Department of Computer
Science, Tangshan Normal
University, Hebei, China. Her re-
search interests include trustwor-
thy service, cloud computing and
information security.

13

You might also like