2023-Hierarchical Attention Network For Multivariate Time Series
2023-Hierarchical Attention Network For Multivariate Time Series
2023-Hierarchical Attention Network For Multivariate Time Series
https://doi.org/10.1007/s10489-022-03825-5
Abstract
Multivariate time series long-term forecasting has always been the subject of research in various fields such as economics,
finance, and traffic. In recent years, attention-based recurrent neural networks (RNNs) have received attention due to their ability
of reducing error accumulation. However, the existing attention-based RNNs fail to eliminate the negative influence of irrelevant
factors on prediction, and ignore the conflict between exogenous factors and target factor. To tackle these problems, we propose a
novel Hierarchical Attention Network (HANet) for multivariate time series long-term forecasting. At first, HANet designs a
factor-aware attention network (FAN) and uses it as the first component of the encoder. FAN weakens the negative impact of
irrelevant exogenous factors on predictions by assigning small weights to them. Then HANet proposes a multi-modal fusion
network (MFN) as the second component of the encoder. MFN employs a specially designed multi-modal fusion gate to
adaptively select how much information about the expression of current time come from target and exogenous factors.
Experiments on two real-world datasets reveal that HANet not only outperforms state-of-the-art methods, but also provides
interpretability for prediction.
Keywords Multivariate time series . Hierarchical attention . Deep neural network . Long-term forecasting . Multi-modal fusion
13
Hierarchical attention network for multivariate time series long-term forecasting 5061
several general drawbacks. Firstly, the multivariate time se- long-term prediction, and outperforms the state-of-the-art
ries data provide target factor and a variety of exogenous methods.
factors, but some exogenous factors present weak correlation
with prediction task. For instance, crude oil prices have strong
correlation with gasoline prices, but have a weak contribution
to wood prices [22]. However, the RNN-based encoder blends
the information of all factors into a hidden state for prediction. 2 Related work
Secondly, these methods ignore the information conflict be-
tween target factor and exogenous factors, which will damage Time series forecasting is an important field of academic research
the accuracy of the model. Li et al. [15] found that linking and forms part of applications such as natural disaster forecasting
some of the factors together did not help the stock price pre- [5], medical diagnosis [23], traffic flow prediction [24], and stock
diction, and even reduced the model accuracy. market analysis [25]. The statistical methods, such as
To tackle these challenges, we propose a novel Hierarchical autoregressive integrated moving average ARIMA [6] and
Attention Network (HANet) for multivariate time series long- Holt-Winters [7] are two widely used models for time series
term forecasting. In particular, HANet is a hierarchically struc- forecasting. However, these methods only focus on univariate
tured encoder-decoder architecture, which learns both the im- time series and assume the series is stationary, while practical
portance factors and the long distance temporal dependence. data normally do not meet this constraint [26]. The machine
Meanwhile, HANet employs a specially designed multi- learning methods, such as support vector regression (SVR) [8]
modal fusion network (MFN) to eliminate the information con- and random forest [10], are also important components of time
flict between target factor and exogenous factors. Specifically, series prediction models. Although these methods can capture
we design a factor-aware attention network (FAN) as the first the interaction among features better, they are difficult to cope
component of the encoder to eliminate the negative influence of with the evolution of long-horizon time series due to ignore time
irrelevant exogenous factors. FAN assigns appropriate weights dependence [11]. In recent years, deep neural networks have
to each factor in the exogenous series and converts the sequence been successfully applied to time series forecasting. For example,
into high-level semantics. Thus, FAN limits the contribution of Qin et al. [27] combined ARIMA and deep belief network
irrelevant factors to high-level semantics by assigning small (DBN) to predict the occurrence of red tides. Shin et al. [28]
weights to them. To address the second challenge, we carefully proved that deep neural networks had better generalization ability
designed a multi-modal fusion network (MFN) as the second and higher prediction accuracy than traditional shallow neural
component of the encoder. In the encoding stage, MFN trades networks. Moreover, recurrent neural networks (RNNs), espe-
off how much new information the network is considered from cially its variants LSTM [29] and gated recurrent unit (GRU)
target and exogenous factors through specially designed multi- [30], are widely used in various time series forecasting tasks,
modal fusion gate. In addition, we introduce a temporal atten- such as traffic flow forecasting [2], stock price forecasting [15],
tion between the encoder and decoder network, which can and COVID-19 prediction [23].
adaptively select relevant encoder input items across time steps Taieb et al. [31] reviewed the existing long-term forecasting
for improving forecasting accuracy. The main contributions of strategies, namely recursive, direct, and multilevel strategies.
this study are as follows: Encoder-decoder is a seq2seq model proposed by Sutskever
et al. [32], which utilizes two independent RNNs to encode se-
1. We propose a novel Hierarchical Attention Network for quential inputs into context vectors and decode these contexts
multivariate time series long-term forecasting. As a hier- into desired sequence. Many scholars devoted to developing
archically structured neural network, HANet learns both multivariate time series long-term forecasting methods with var-
the importance factors and the long distance temporal ious encoder-decoder architectures [33]. For instance, Kao et al.
dependence. Meanwhile, HANet can alleviate the infor- [5] used the encoder-decoder model for multi-step flood forecast-
mation conflict between target and exogenous factors. ing and achieved good results. However, the performance of the
2. We design a factor-aware attention network (FAN) to encoder-decoder decreases with the increasing of the input se-
eliminate the negative influence of irrelevant exogenous quence because the encoder compresses the input into a fixed
factors. FAN limits the contribution of irrelevant factors vector. Fortunately, the attention mechanism can solve this chal-
to prediction by assigning small weights to them. lenge. Attention is a soft selection strategy of information, which
3. We introduce a multi-modal fusion network (MFN). allows items from encoder hidden states can be selected by their
MFN can alleviate the information conflict between target importance to decoder [34]. Therefore, attention-based RNNs
and exogenous factors through specially designed multi- further stimulated the related research on time series long-term
modal fusion gate. prediction. Liu et al. [21] developed a dual-stage two-phase
4. Experiments on two fields of air quality and ecological attention-based RNN (DSTP-RNN) for multivariate time series
datasets show that HANet is very effective in time series long-term forecasting. The model not only took into account the
13
5062 H. Bi et al.
factors’ spatial correlation, but also the time dependencies among hierarchical structure that consists of factor-aware attention
the series. network (FAN), multi-modal fusion network (MFN), and
LSTM. FAN is executed on the exogenous series with the
purpose of eliminating the negative effects of irrelevant fac-
3 Problem definition and notations tors. Specifically, FAN assigns different weights to xkt ∈xk (1
≤ t ≤ T) according to its importance to the prediction, and
Multivariate time series (MTS) long-term forecasting aims to summarizes exogenous series xt= {x1t ; x2t ; …; xnt } into a high-
study how to predict the future multiple time steps’ target level semantic e xt . Thus, FAN limits the contribution of irrel-
series based on historical target and exogenous series. Given evant factors to high-level semantics by assigning small
n (n ≥ 1) exogenous series and one target series, we utilize weights to them. Subsequently, the high-level semantics e xt
symbol xk= xk1 ; xk2 ; …; xkT ∈ℝT to represent k-th exogenous are fed into LSTM, and generate a hidden state hxt . The hidden
series within the length of window size T, and we use symbols state hxt is entered into the MFN along with the target factoryt.
fxt gTt¼1 = {x1, x2, …, xT} and fyt gTt¼1 = {y1, y2, …, yT} to MFN trades off how much new information the network is
denote historical all exogenous and target series in T time considered from target factor yt and hidden state hxt through
slice. The symbol xt (1 ≤ t ≤ T) is a vector, wherext= specially designed multi-modal fusion gate. The decoder is
{x1t ; x2t ; …; xnt }∈ℝn and n is the number of exogenous factors composed of LSTM and a single layer multilayer perceptron,
at time t. The symbol yt ∈ ℝ is the target factor at time t. which is located on the top of the encoder. Furthermore, we
Obviously, the output is an estimate of the target factor for design a temporal attention (TA) and place it between the
T þΔ encoder and the decoder. TA acts as a bridge between the
Δ time steps after T, denotes as fbyt gt¼T þ1 = {b
yT þ1 ,byT þ2 ,…,
decoder unit i and the encoder. The function of this bridge is
byT þΔ }, where Δ (Δ ≥ 1) is a variable according to the
to select the most relevant information in the encoder for pre-
demand of the task. To sum up, HANet is to learn a nonlinear diction. Finally, HANet leverages a single-layer multilayer to
mapping from exogenous series fxt gTt¼1 and target series convert the state di into the predicted value byi (1 ≤ i ≤ Δ).
fyt gTt¼1 in the history to the estimation of the future value
T þΔ
fbyt gt¼T þ1 : 4.2 Encoding process
f ðÞ n oTþΔ
fxt gTt¼1 ; fyt gTt¼1 → byt ð1Þ As shown in Fig. 1, the encoder is a hierarchically structured
t¼Tþ1
network which consists of a factor-aware attention network
Where f(·) is the non-linear mapping function. (FAN), a multi-modal fusion network (MFN), and a long
short-term memory network (LSTM). In the coding phase,
FAN and MFN play different roles. The former eliminates
4 Proposed HANet model the negative effects of irrelevant factors, while the latter
balances the information conflict between the target and
In this section, we introduce the proposed HANet model for exogenous factors. Next, we describe the encoding process
multivariate time series long-term forecasting problem. We in detail.
first present an overview of the model. Subsequently, we de-
tail the model with all components. Standard LSTM LSTM has already become an extensible and
useful model to address the problem of learning sequential
data [13]. LSTM contains a memory cell ct and three gates,
4.1 An overview of HANet
i.e., forget gate ft, input gate it, and output gate ot. The three
gates adjust the information flow into and out of the cell. In
As mentioned in section 1, there are some problems to be
this work, we use LSTM as the first layer of FAN. Given the
solved in the long-term forecasting of multivariate time series:
1) eliminate the negative influence of irrelevant exogenous input series fxt gTt¼1 = {x 1 , x 2 , …, x T }, where x t =
factors and 2) balance the information conflict between target {x1t ; x2t ; …; xnt }∈ℝn is a vector with n exogenous factors at
and exogenous factors. Moreover, a successful time series time t. LSTM is applied to learn a mapping from xt to hidden
long-term forecasting method should be able to capture the state ht. The calculation is defined in eq. (2).
long-term dependence between sequences. Therefore, we pro-
it ¼ σðWi xt þ Ui ht−1 þ bi Þ
pose a Hierarchical Attention Network (HANet) for multivar-
f t ¼ σ W f xt þ U f ht−1 þ b f
iate time series long-term forecasting. The architecture of ot ¼ σðWo xt þ Uo ht−1 þ bo Þ ð2Þ
HANet is shown in Fig. 1. ct ¼ f t ⊙ct−1 þ it ⊙tanhðWc xt þ Uc ht−1 þ bc Þ
As shown in Fig. 1, HANet is a hierarchically structured ht ¼ ot ⊙tanhðct Þ
encoder-decoder model. Similarly, the encoder is also a
13
Hierarchical attention network for multivariate time series long-term forecasting 5063
Fig. 1 A graphical illustration of Hierarchical Attention Network {x1t ; x2t ; …; xnt }∈ℝn is a vector with n exogenous factors at time t.
(HANet). The encoder of HANet consists of three components, i.e., MFN is the multi-modal fusion network, which generates a hidden rep-
factor-aware attention network (FAN), multi-modal fusion network resentation zt by fusing target factor yt and hidden state hxt . di is the hidden
(MFN), and LSTM. Here, xk ¼ xk1 ; xk2 ; …; xkT ∈ℝT is k-th exogenous state of decoder unit i, coi is the context vector generated by temporal
series within the length of window size T. yt is the target factor at time step attention. b yi is the predicted value
t. e
xt is the high-level semantic representation of x t , where x t =
Where the symbol ⊙ is the element-wise product, and σ = 1/(1 corresponding to xt. Typically, given the k-th attribute vector
+ e−x) is the sigmoid function. ht ∈ ℝm is the hidden state of of any exogenous series at time t (i.e., xk), we can employ the
LSTM at time t. m is the size of LSTM hidden unit. The symbols following attention mechanism:
W∗ ∈ ℝm × m, U∗ ∈ ℝm × m, and b∗ ∈ ℝmare learned during
the training process. Obviously, the output of all three gates is ekt ¼ vT tanh We hxt−1 þ Ue xk
between 0 and 1 after sigmoid function. Hence, if ft is approxi- exp ekt
αkt ¼ n ð3Þ
mately 1 and it approaches 0. The previous memory cell ct − 1
∑ exp em t
can be saved and passed to the current time step. Similarly, if ot is m¼1
approximately 1, we pass the information of ct to ht. It means the
Where v ∈ ℝT,We ∈ ℝT × 2m, and Ue ∈ ℝT × Tare learnable
hidden state ht captures and retains the input sequences’ historical
parameters, and (·)Τstands for matrix transpose. Here, hxt−1 ∈
information to the current time step.
ℝ m is the hidden state of LSTM at time step t, and m is the size
of LSTM unit. The attention weights are determined by his-
Factor-Aware Attention As shown in eq. (2), LSTM blindly
torical hidden state and current input, which represent the
blend the information of all factors into a hidden state for
impact of each exogenous factor on forecasting. With these
prediction. Therefore, the hidden state incorporates negative
attention weights, we can adaptively extract the important
information about irrelevant factors. However, real life expe-
exogenous series:
rience shows that the contribution of each factor is different
for prediction. Hence, we proposed a factor-aware attention
e
xt ¼ α1t x1t ; α2t x2t ; …; αnt xnt ð4Þ
network. The factor-aware attention network is composed of
two layers feed forward neural network. In first layer, we Then the hidden state at time t can be updated as:
assign appropriate weights to each factor in the exogenous
series at time t. In second layer, we aggregate these hidden hxt ¼ LSTM hxt−1 ; e
xt ð5Þ
features to generate a high-level semantic e xt ∈ℝ m
13
5064 H. Bi et al.
Where LSTM(·) is an LSTM unit that can be computed ac- encoder, previous prediction value, and the current decoder
cording to eq. (2), xt is replaced by the newly computed e xt . unit through temporal attention and LSTM. In the second
The symbol hxt ∈ℝ m is the output of factor-aware attention at phase, the decoder unit maps the hidden state di(1 ≤ i ≤ Δ)
time step t. into the final prediction result byi .
Multi-modal fusion network (MFN) To alleviate the informa- Temporal attention The core component of temporal attention
tion conflict between target and exogenous factors, we design is the attention layer. The input of the attention layer consists
a multi-modal fusion network, as shown in Fig. 1. In our of the encoder hidden state sequence fht gTt¼1 = (h1, h2, …, hT)
opinion, target factor information is the most important feature and the hidden state of the decoder. For the convenience of
and cannot be ignored. The exogenous factors are auxiliary description, we use symbol di ∈ ℝm to represent the hidden
information when we are trying to understand the dynamics of state of decoder unit i. For each decoder unit i, attention
the target factor. Hence, in our model, we obtain the hidden returns the corresponding context vector coi. The context vec-
representation of the target factor yt by fusing high-level se- tor coi is the weighted sum of all hidden states of the encoder
mantics. Because the hidden state hxt more or less contains and their corresponding attention weights, as illustrated in
noise information, we use a multi-modal fusion gate to com- Fig. 2.
bine features from different signals, thus better represent the In particular, given encoder hidden state sequence fht gTt¼1
information that are needed to solve a particular problem. The = (h1, h2, …, hT) and previous decoder hidden state di − 1, the
multi-modal fusion gate is a scalar in the range of [0, 1]. The attention returns an importance score to ht(1 ≤ t ≤ T). Then
multi-modal fusion gate is 1 when the hidden state hxt is help- the softmax function transforms the importance score to an
ful to improve the representation of target factor yt, otherwise, attention weight. The attention weight measures the similarity
the value of the gate is 0. The fusion process can be calculated between ht and di − 1. It also means the importance of ht to di
by eq. (6).
− 1. The attention mechanism’s output is the weighted sum of
all hidden state in fht gTt¼1 , where the weight is the attention
st ¼ σ W x
y yt :Wx hxt
ut ¼ st ⊙ tanh Ux ht ð6Þ weight. The above process can be expressed by eq. (8).
zt ¼ Ws Wy yt : ut
eti ¼ vT tanhðWht þ Udi−1 Þ
Where [:] denotes concatenation, the symbols Wx, Wy ∈ ℝ m exp eti
β ti ¼ T
× m
are learnable parameters. The information matrix Ux ∈ ð8Þ
∑ expðeki Þ
ℝ2m × m converts the vector hxt ∈ℝ m into a 2 m-dimensional k¼1
vector. The value of st(st ∈ ℝ2m) is mapped to the interval [0, T
coi ¼ ∑ β ti ht
1] by logistic sigmoid function.⊙is element-wise multiplica- t¼1
tion. Obviously, the multi-modal fusion gate will ignore the i-
dimensional information of Ux hxt when sit = 0 (1 ≤ i ≤ 2m). Where the symbol (∗)Τ means the transpose of matrix. di − 1
∈ ℝm is the hidden state of decoder unit i − 1. eti is the
The symbol Ws ∈ ℝm × 3m is a parameter matrix. In the end,
we fuse hidden state hxt and yt into the representation zt ∈ ℝm. importance score. β ti is the attention weight. coi ∈ ℝm is the
To model the temporal dependency of fusion feature se- context vector.
quences fzt gTt¼1 = (z1, z2, …, zT), we utilize LSTM via the
following eq. (7).
it ¼ σðWi zt þ Ui ht−1 þ bi Þ
f t ¼ σ W f zt þ U f ht−1 þ b f
ot ¼ σðWo zt þ Uo ht−1 þ bo Þ ð7Þ
ct ¼ f t ⊙ct−1 þ it ⊙tanhðWc zt þ Uc ht−1 þ bc Þ
ht ¼ ot ⊙tanhðct Þ
13
Hierarchical attention network for multivariate time series long-term forecasting 5065
Subsequently, we use context vector coi to update the hid- series has 11 factors: Chlorophyll (Chl), Sea surface tempera-
den state of LSTM. Specifically, for decoder unit i, LSTM ture (Temp), dissolved oxygen (DO), Saturated dissolved ox-
obtains its hidden state di by combining vector coi, previous ygen (SDO), Tide, Air temperature (Air_temp), Standard at-
hidden state di − 1, and predicted value byi−1 . The above pro- mospheric pressure (Press), Turbidity, PH, and two meteorol-
cess can be expressed by Eq. (9) ogy wind, denoted as Wind_u and Wind_v. In this paper,
h i chlorophyll concentration is considered as the target factor.
ii ¼ σ W* byi−1 : coi þU* di−1 þb* The dataset is split into training set (90%) and test set (10%)
h i
in chronological order.
f i ¼ σ W f byi−1 : coi þU f di−1 þb f
h i
oi ¼ σ Wo byi−1 : coi þUo di−1 þbo ð9Þ 5.2 Baselines approaches
h i
eci ¼ tanh Wc byi−1 : coi þUc di−1 þbc
Our experiments are divided into two parts. The first part is to
ci ¼ f t ⊙ci−1 þ it ⊙eci compare our model with the previous state-of-the-art models.
di ¼ ot ⊙tanhðci Þ The second part is ablation experiment, which compares our
model with the degrade version of our model. The specific
Where [:] is the concatenation operation, and byi−1 is the pre- descriptions are as follows:
dicted value of the decoder unit i − 1.
Seq2Seq: A model based on encoder-decoder. Kao et al.
Task Learning In this paper, we use a multilayer perceptron as [5] applied the model to multi-step advance flood fore-
the task learning layer of the model with the purpose of cal- casting and achieved good accuracy. Therefore, the ex-
culating the predicted results of decoder unit i. Concretely, the periment uses it as one of the benchmark methods.
predicted value byi is a transformation of di, which can be GED: A based-attention multivariate time series long-
calculated by the following equation: term forecasting model proposed by Xie et al. [17].
Experimental results on Bohai Sea and South China Sea
byi ¼ Wp di ð10Þ
surface temperature data sets showed that its performance
Where Wp ∈ ℝm is a learnable parameter. was better than traditional methods, such as SVR.
STANet: A multivariate time series forecasting method
proposed by He et al. [12]. They employed the model for
multistep-ahead forecasting of chlorophyll. In this work,
5 Experimental results and analyses we implement it as a baseline method.
DA-RNN: DA-RNN is a one-step-ahead time-series fore-
5.1 Datasets casting method proposed by Qin et al. [20]. The model
introduced an attention mechanism for both encoder and
To compare the performance of different models on various decoder. DA-RNN assumed the inputs must be correlated
types of MTS long-term forecasting problems, we use two among the time. Since DA-RNN is a one-step ahead fore-
available actual data sets to evaluate our proposed model casting method, we implement it in the long-term predic-
and baseline method. The datasets used in our experiment tion based on direct strategy.
are described as follows: DSTP-RNN: The long-term forecasting model of multi-
variate time series proposed by Liu et al. [21]. DSTP-
Beijing PM 2.5 data The dataset contains PM2.5 data and the RNN took into account the spatial correlation between
corresponding meteorological data. We select 17,520 time factors and the time dependencies of the series.
series from January 1, 2013 to December 31, 2014. Each time HCA-GRU: A hierarchical attention-based network for
series has 8 factors: dew point, air temperature, standard at- sequential recommendation proposed by Cui et al. [34].
mospheric pressure, wind direction, wind speed, hours of The model combined the long term dependency and
snow, hours of rain, PM2.5 concentration. In this paper, user’s short-term interest. In this work, we set the time
PM2.5 concentration is considered as the target factor. The frame interval to 1, and modify the output layer by map-
dataset is split into training set (80%) and test set (20%) in ping the learned nonlinear combination into a scalar.
chronological order. MsANet: A multivariate time series forecasting model
proposed by Hu et al.[37]. MsNet employed influential
Chlorophyll data The data set is taken from Tongan Bay attention and temporal attention to extract local depen-
(118o12′N,24o43′E). The monitoring period is from January dency patterns among factors and discover long-term pat-
2009 to July 2017. There are 8733 time series. Each time terns of time series trends.
13
5066 H. Bi et al.
Table 1 Performance of different methods compared in Chlorophyll dataset for different predictive horizons
Methods τ= 1 τ= 6 τ= 12 τ= 24
Besides, we implement one degraded version of our pro- 5.4 Experimental results and analysis
posed model. The degrade version is used for ablation
experiments: To measure the effectiveness of the proposed model in the
long-term prediction of MTS, we adopt root means square
HA-LSTM: In this version, we remove the MFN com- error (RMSE) and mean absolute error (MAE) to assess the
ponent. That is to say, we blend the information of all forecasting performance. They are calculated by the follow-
factors (include target and exogenous factors) into a high- ing:
level semantic through factor-aware attention.
N
∑ yt −byt
t¼1
MAE ¼ ð11Þ
N
5.3 Parameters and experimental settings and
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
In this work, we set the learning rate of all methods to uN 2
u
u ∑ yt −byt
0.0001. To maintain the consistency, we use the same size t t¼1
for all LSTMs’ hidden units. We conduct grid search for RMSE ¼ ð12Þ
N
the size of LSTM’s hidden state over {15, 25, 35, 45}. We
set the size of time window to 24, i.e., T=24. For each Where N is the number of samples, yt is the real value, and byt is
sample, the first 24 time series is the input of models. We the corresponding predicted value. The closer to 0 they are,
compare our model with previous state-of-the-art methods the higher the algorithm accuracy the model has.
on two datasets for long-term forecasting tasks. We test all
methods with horizon τ∈{1, 6, 12, 24} to show their Performance analysis In Tables 1 and 2, we show the perfor-
effectiveness. mance of different methods in different prediction horizons.
Table 2 Performance of different methods compared in PM 2.5 dataset for different predictive horizons
Methods τ= 1 τ= 6 τ= 12 τ= 24
13
Hierarchical attention network for multivariate time series long-term forecasting 5067
The best performance is highlighted in bold. According to each factor. STANet, DA-RNN, and MsANet consider the cor-
Tables 1 and 2, HANet model is most suitable for long-term relation among exogenous factors. DSTP-RNN not only con-
forecasting task of multivariate time series in the mentioned siders the correlation among exogenous factors, but also focus-
models (13 out of 16 cases). However, the performance of all es on the correlation between exogenous and target factors. For
methods decreases with the increasing of horizon. That is to the same purpose, HANet adaptively filters out untrustworthy
say, different model has different negative effects with the in- exogenous factors through factor-aware attention. According to
creasing of horizon. DA-RNN is a one-step ahead forecasting Tables 1 and 2, we find that the attention-based model outper-
model, which needs to train a model for each time step in the forms Seq2Seq in different prediction horizons. Therefore, we
future. In fact, the time series is a time-varying process, and its conclude the attention mechanism can effectively establish
uncertainty increases as the time interval continues to increase. long-range temporal dependencies, thereby improving the per-
Since the negative effect of uncertainty, the performance of formance of the model’s long-term prediction.
DA-RNN decreases with the increase in horizon. On the con-
trary, the other models only train one model to implement Visualization of analysis To further illustrate the performance
multi-step ahead forecasting. However, these models take the of HANet, we visualize the experimental results. For space
previous predicted value as the current input, so the error will reason, we only show the situation of horizon = 24 in Fig.
gradually accumulate as the horizon continues to increase. The 3. We show the MAE of HANet and other state-of-the-art
performance of HANet is better than baseline approaches ac- models at different time steps. Here, the x-axis indicates dif-
cording to Tables 1 and 2. Obviously, the increase of horizon ferent time steps, and the y-axis is the corresponding MAE
has less negative impact on HANet than baseline methods. The value. Intuitively, compared with other baseline approaches,
experimental results show that the performance of GED is low- the visualization results also support the superiority of HANet.
er than other attention-based models. This is because other Furthermore, we see that the performance of Seq2Seq de-
models analyze time series from different perspectives. grades faster than other attention-based methods with the in-
Specifically, HA-LSTM and HCA-GRU attenuate the negative creasing of predicted horizon. This result demonstrates the
effects of irrelevant factors by distinguishing the contribution of effectiveness of introducing an attention mechanism.
Methods τ= 1 τ= 6 τ= 12 τ= 24
13
5068 H. Bi et al.
Methods τ= 1 τ= 6 τ= 12 τ= 24
Ablation experiments analysis To study the performance gain of each exogenous factor. The experimental results show that
of HANet’s components, we conduct ablation study by imple- each exogenous factor has a different contribution to a given
menting one degraded version of HANet, i.e., HA-LSTM. In time step (i.e., the x-axis value). In other words, HANet model
HA-LSTM, we remove the MFN component. Therefore, HA- can not only distinguishes the importance of each factor, but
LSTM can be seen as GED adds a FAN module before its also captures its dynamic changes. Moreover, the experimen-
encoder. In Tables 3 and 4, we show the results of ablation tal results show that Sea surface temperature (Temp), air tem-
experiments, and the best performance is highlighted in perature (Air_temp), and PH are the most important factors,
bold. According to Tables 3 and 4, the evaluation results on which are consistent with the studies in [35]. Meanwhile, we
two real datasets show that the prediction performance of HA- also found that the irrelevant factors’ weight of factor-aware
LSTM is better than GED, which prove the effectiveness of attention is approximately 0, such as standard atmospheric
FAN. Besides, the performance of HANet is better than HA- pressure (Press). Obviously, our model is credible and can
LSTM. Obviously, the experimental results show the effec- provide interpretability for the research object.
tiveness of MFN.
Statistical analysis Statistical tests are used to examine the
Attention weight analysis To further analyze HANet, we vi- forecasting performance between HANet and the Baseline
sualize the weight distribution of factor-aware attention, as methods. In Table 5, we show the paired two-tailed t-tests
shown in Fig. 4. For reason of space, we only show the situ- results of all methods. In addition, we calculate and compare
ation of {1,6,12,24}th time step when the horizon is 24 in the average RMSE for different prediction horizon because
chlorophyll dataset. In Fig. 4, the x-axis indicates different the t-test results are easily affected by the sample size. The
time steps, and the y-axis is corresponding attention weight results prove that HANet is superior to other state-of-the-art
13
Hierarchical attention network for multivariate time series long-term forecasting 5069
methods at the 5% statistical significance level on PM2.5 data. their ability of reducing error accumulation. Since RNNs
The paired two-tailed t-test show that certain models (i.e., blindly blend the information of the target and non-
DSTP-RNN and HA-LSTM) are as accurate as HANet on predictive variables into a hidden state for prediction, the
chlorophyll data set. However, the proposed HANet has a existing attention-based RNNs cannot eliminate the negative
smaller average RMSE value according to Table 5. influence of irrelevant factors. Meanwhile these models ig-
Therefore, HANet has better predictive performance. In sum- nore the conflict between target and exogenous factors. In this
mary, HANet provides better predictive performance than oth- work, we propose a Hierarchical Attention Network (HANet)
er state-of-the-art forecasting methods. for multivariate time series long-term forecasting. HANet is a
hierarchically structured encoder-decoder architecture, which
5.5 Limitations of HANet learns both the importance factors and the long-distance tem-
poral dependence. Specifically, we design a factor-aware at-
Although HANet can handle multivariate time series long- tention (FA) as the first component of the encoder to eliminate
term forecasting problem and has a certain interpretability, the negative influence of irrelevant exogenous factors. To ad-
which is proved by the ‘Performance analysis’, ‘Attention dress the second challenge, we carefully develop a multi-
weight analysis’, and ‘Statistical analysis’ parts, it still has modal fusion network (MFN) as the second component of
some limitations: the encoder. In the encoding stage, MFN trades off how much
new information the network is considering from target and
1) To effectively improve multivariate time series long-term exogenous factors through specially designed multi-modal fu-
prediction performance, HANet sacrifices more comput- sion gate. Besides, we introduce a temporal attention between
ing resources compared with other baseline approaches. the encoder and decoder network, which can adaptively select
2) Inspired by human attention mechanism including the relevant encoder input items across time steps for accurate
dual-stage two-phase (DSTP) model and the impact forecasting. Experimental results show that HANet is very
mechanism of exogenous and target factors [36], we de- effective and outperforms the state-of-the-art methods. We
sign the FAN as a component of HANet. However, DSTP also visualize the weight distribution of factor-aware attention.
assumes that the input of the model is time-dependent, but The visualization of the attention weight shows that our model
this assumption may not always be true in practical has great interpretability.
applications. Another challenge for multivariate time series long-term
forecasting is how to maintain the trend consistency between
the prediction series and the original real series. In future
work, we will further focus on maintaining the trend consis-
tency at less computational costs.
6 Conclusion
13
5070 H. Bi et al.
13
Hierarchical attention network for multivariate time series long-term forecasting 5071
13