A Location-Velocity-Temp SYLL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

A Location-Velocity-Temporal Attention
LSTM Model for Pedestrian
Trajectory Prediction
HAO XUE1 , DU Q. HUYNH1 , (SENIOR MEMBER, IEEE), AND MARK REYNOLDS.1 , (Member,
IEEE)
1
Department of Computer Science and Software Engineering, The University of Western Australia, Perth, 6009, Australia. (e-mail:
[email protected], {du.huynh, mark.reynolds}@uwa.edu.au)
Corresponding author: Hao Xue (e-mail: [email protected]).
Hao Xue is supported by an International Postgraduate Research Scholarship (IPRS) at UWA. We gratefully acknowledge the support from
NVIDIA Corporation of a Titan Xp GPU used in this research.

ABSTRACT Pedestrian trajectory prediction is fundamental to a wide range of scientific research work
and industrial applications. Most of the current advanced trajectory prediction methods incorporate context
information such as pedestrian neighbourhood, labelled static obstacles, and the background scene into
the trajectory prediction process. In contrast to these methods which require rich contexts, the method
in our paper focuses on predicting a pedestrian’s future trajectory using his/her observed part of the
trajectory only. Our method, which we refer to as LVTA, is a Location-Velocity-Temporal Attention
LSTM model where two temporal attention mechanisms are applied to the hidden state vectors from the
location and velocity LSTM layers. In addition, a location-velocity attention layer embedded inside a
tweak module is used to improve the predicted location and velocity coordinates before they are passed
to the next time step. Extensive experiments conducted on three large benchmark datasets and comparison
with eleven existing trajectory prediction methods demonstrate that LVTA achieves competitive prediction
performance. Specifically, LVTA attains 9.19 pixels Average Displacement Error (ADE) and 17.28 pixels
Final Displacement Error (FDE) for the Central Station dataset, and 0.46 metres ADE and 0.92 metres
FDE for the ETH&UCY datasets. Furthermore, evaluation on using LVTA to generate trajectories of
different prediction lengths and on new scenes without the need of retraining confirms that it has good
generalizability.

INDEX TERMS Pedestrian trajectory prediction, long short-term memory (LSTM), attention models,
human movement analysis.

I. INTRODUCTION been a surge of interest in pedestrian trajectory prediction and


RAJECTORY prediction is essential for a wide range of various new methods have been reported. Methods that were
T applications such as forecasting trajectories of vulnera-
ble road users in traffic environments [1] and location based
proposed prior to 2015 follow the SFM method by exploring
the physical aspects of crowd movement; for instance, by
services [2, 3]. It is also an important component for Ad- minimizing collisions among pedestrians [6], by modelling
vanced Driver Assistance Systems (ADAS) and autonomous the pedestrian dynamics in terms of interaction forces and po-
vehicles [4]. tential energies of particles [7], by taking into account small
One way to predict pedestrians’ trajectories in a scene is to obstacles, like vending machines, dumpster, etc, in the scene
model the physics of the human movement patterns. A clas- that influence the pedestrian trajectories [8], or by modelling
sical paper is the social force model (SFM) [5] which uses human motion using ensemble Kalman filtering [9]. With the
two different types of forces to capture these patterns: the booming of data-driven deep learning networks, we see a
attractive forces which pull people towards their destinations; vast growth recently in the literature of pedestrian trajectory
and the repulsive forces which keep people apart and away prediction, focusing on the use of Convolutional Neural
from obstacles in the scene. In the last few years, there has Networks (CNNs), Recurrent Neural Networks (RNNs), and

VOLUME 12, 2019 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

Long Short-Term Memory (LSTM) networks [10–21]. How- both methods are trained and tested on the same scene
ever, the concept of modelling human-human interaction and (dataset). In addition, LVTA has better generalizability, as
human-scene relationship remains in these new methods and demonstrated from its superior performance on forecasting
has been implemented in the form of social pooling layers predictions on a new, unseen scene. For different prediction
or social pooling modules of the network [11, 22, 23]. Some lengths of trajectories, LVTA also consistently outperforms
of these new methods incorporate additional information like LVA. Furthermore, compared to several recent trajectory pre-
the head poses of the pedestrians [17] or give a more explicit diction methods, LVTA achieves state-of-the-art prediction
treatment to the scene contexts such as static obstacles [24], performance on two large benchmark datasets.
labelled entrance/exit regions [25], and even the whole back- In summary, our research contributions are:
ground scenes [12, 18]. • Our proposed architecture has two LSTM layers to cap-
While the methods mentioned above produce promising ture the embeddings of location and velocity coordinates
prediction results, they require the neighbouring pedestrians of trajectories. It does not rely on scene information and
to be captured alongside each person of interest (POI) both has good generalizability.
spatially and temporally. Not only does this requirement • Our architecture has a module, which includes a
increase the computation time and storage space, a proper location-velocity attention layer, to tweak the outputs
neighbourhood size around the POI must be defined for from the LSTM layers. As demonstrated in our abla-
each scene in order for these methods to produce their best tion study, the tweak module helps to give significant
performance. This further reduces the generalizability of improvement to the prediction results.
these methods, i.e., the methods cannot be directly applied to • The temporal attention mechanism incorporated in our
scenes on which the methods were not trained. Obviously, if LVTA method is inspired by the work in machine trans-
scene information is not available, methods that require scene lation. It captures the relationship of the hidden state
information as input cannot be applied either. vectors between the observed and predicted parts of tra-
Contrary to the methods outlined above, simpler yet sim- jectories. Our experiments show that temporal attention
ilarly effective methods have also been studied. One ex- helps further improve the prediction performance.
ample is the trajectory prediction method of Nikhil and
Morris [14]. To achieve real-time performance, their method The rest of the paper is organized as follows. Section II
uses parallelizable convolutional layers and incorporates no gives an overview of the related work on trajectory pre-
social or scene information. Another example is the method diction. Section III details our LVTA architecture and the two
of Schöller et al. [26] where the authors revisit and use main attention mechanisms. Section IV begins with an out-
the simple constant velocity model to predict the relative line of the datasets and the metrics used in the experiments.
displacements between consecutive location points. In our Detailed implementation, including hyperparameter tuning,
previous work [27], apart from the velocity information com- ablation study, generalizability study, and comparison with
puted directly from the input trajectories, our joint Location state-of-the-art methods take up a large part of this section.
Velocity Attention LSTM based network (LVA) also requires Also included in the section is the computation times of LVA
no neighbourhood or scene information. For ADAS and and LVTA. Finally, the paper is concluded in Section V.
driverless vehicle applications where computation resources
are limited, these simpler methods are more preferrable. II. RELATED WORK
In this paper, we extend our previous LVA method by In this section, we give a brief review of the literature on
incorporating two attention mechanisms, which we refer to as pedestrian trajectory prediction, focusing especially on tech-
temporal attention, to appropriately weight the hidden state niques that we compare with our proposed method. A large
vectors output by the location-LSTM and velocity-LSTM number of existing methods employ context information such
layers of the network. We name our proposed trajectory as human-human interaction (e.g., [28, 11, 16, 22, 20]) and/or
prediction method LVTA, where ‘T’ stands for the added human-space interaction (e.g., [24, 29, 12, 18, 21, 30]).
temporal attention mechanism. Our LVTA method has the Similar to our work, there are also methods that incorporate
advantage that the prediction process only depends on the attention mechanisms. We therefore group existing methods
trajectory of the POI. Neither scene information nor neigh- into two broad categories, namely methods with context and
bouring trajectories is required in LVTA. Instead, a tweak methods with attention, in the two subsections below.
module is used to fuse the location and velocity information
captured in the observed part of the POI’s trajectory. Because A. TRAJECTORY PREDICTION WITH CONTEXT
of that, the architectures of both methods do not have pooling The Social LSTM model of Alahi et al. [11] is one of the
layers or pooling modules as used in [11, 23, 22]. As shown early trajectory prediction methods based on the sequence
in our previous work [27], LVA already has good prediction generation model from Graves et al. [31]. Their model com-
performance, our extensive experiments confirm that the bines the behaviour of other people within a local neigh-
inclusion of the temporal attention mechanisms significantly bourhood of the POI. Another piece of early work is the
improves the performance of LVTA and its generalizability. behaviour-CNN method of Yi et al. [32], where the pedes-
Specifically, our proposed LVTA outperforms LVA when trians’ walking paths are encoded and predicted in the form
2 VOLUME 12, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

of 3D displacement volumes using a number of convolutional the important regions for each agent’s future path. For the
layers. As each displacement volume contains all pedestrians SoPhie method [30] mentioned above, two attention modules
at the same time period, the behaviour-CNN is able to capture are used to deal with scene context and social interactions.
their influence to each other. Following the work of Alahi et Our LVTA differs from the methods reviewed above in that
al. [11], and with the success of the Generative Adversarial it does not require neighbouring and scene information when
Network (GAN) [33] in other applications [34, 35], Gupta et the method predicts the future trajectory of each POI. It uses
al. [22] propose the Social GAN model (abbreviated as both temporal attention and location-velocity mechanisms on
SGAN) to generate multiple trajectory predictions for each the trajectory embeddings. This is different from the attention
input observed trajectory. Their network includes a pooling mechanism used in [46, 47, 30].
module to expand the neighbourhood around each POI to
cover the whole scene so all the pedestrians can be considered III. METHODOLOGY
in the training and prediction processes. This effectively A. PROBLEM FORMULATION
expands the local neighbourhood context to a global level. We represent the trajectory of the ith pedestrian as a time
The Social-Aware Generative Adversarial Imitation Learning sequence of two dimensional coordinates (xit , yti ), obtained
(SA-GAIL) method of Zou et al. [25] is another example via a tracking method or manual labelling. The coordinates
based on GAN. The authors combined a collision avoid- can be in metres on the ground or in image pixel unit. The aim
ance regularization and the Social LSTM into the trajectory of trajectory prediction is to use the observed locations of the
prediction process. Also using the LSTM architecture, the ith pedestrian, for all i, from time t = 1 to t = Tobs to predict
SR-LSTM method [20] handles pedestrians’ interaction as a locations from t = Tobs + 1 to t = Tobs + Tpred , where Tobs
message passing process among them. and Tpred denote, respectively, the lengths of the observed and
Incorporating scene context as well in the trajectory pre- predicted trajectories. We represent these two segments  of
diction process often requires the coding of the scene features each trajectory as Xiobs = xi1 , y 1
i
, · · · , x i
, y i
Tobs Tobs and
as part of the network architecture; for example, apart from
 i  
Xpred = xTobs +1 , yTobs +1 , · · · , xTobs +Tpred , yTi obs +Tpred .
i i i
encoding the neighbourhood information, Xue et al. [18] use i
a deep CNN to encode scene context in their hierarchical i
 i i trajectoryi Xobsi, the
From the observed  velocity informa-
tion Uobs = u1 , v1 , · · · , uTobs , vTobs is obtained from
LSTM network; Liang et al. [21] use a pretrained scene the finite differences of Xiobs over the
segmentation model to extract a number of semantic scene  time steps. We du-
plicate the first velocity term ui1 , v1i so that Uiobs has the
classes and compute their scene CNN features using two same number of time steps as Xiobs . The velocity information
convolutional layers in their network. Similar to Liang et computed above is equivalent to the relative coordinates
al.’s work, SoPhie [30] also extracts semantic scene features used in some implementation [49] in the literature. In this
but it uses the raw features from the VGGnet-19 network paper, our interest is to generate the predicted trajectory Xipred
and projects them to a lower dimension. Like the SGAN based
model [22], SoPhie uses GAN to generate multiple future  i on i the combined location and velocity information
Xobs , Uobs of the observed trajectory. To simplify the ex-
paths for each input trajectory. planation in the later subsections, the superscript i is dropped
from hereon.
B. TRAJECTORY PREDICTION INCLUDING ATTENTION
The effectiveness of attention mechanism was first demon- B. THE ARCHITECTURE OF LVTA
strated by Bahdanau et al. [36] in the neural machine trans- As shown in Figure 1, two separate LSTM layers are used
lation task. Luong et al. [37] later improved the machine in the proposed LVTA architecture: a location LSTM layer
translation performance by designing different score func- and a velocity LSTM layer. These two layers, each of which
tions (e.g., the dot, general, and concat score functions) for is of hidden dimension Nh , process the embedding vectors
calculating the attention scores. Since then, attention based elt and evt of the location and velocity coordinates of the
deep learning models have been widely used in other tasks, observed trajectories in parallel. To simplify the visualiza-
such as image captioning [38], video captioning [39], action tion, the embedding layers of embedding size Ne are omitted
recognition [40, 41], person re-identification [42, 43], and in the figure. Thus, starting with the location and velocity
time series data classification [44, 45]. coordinates (xt , yt ) and (ut , vt ) at time t, we have a series of
In the area of trajectory prediction, attention mechanisms equations listed below:
have been used for capturing the relative importances of
elt = φl xt , yt ; Wel ,

neighbours around a person [46] and for learning the embed- (1)
 l

ding information of a person’s own trajectory plus the context hlt+1 = LSTMl hlt , elt ; W , (2)
information of surrounding neighbours [47]. Different from >
the above two papers, Sadeghian et al. [48] focus on the (x̂t , ŷt ) = Wol hlt + blo , (3)
problem of a wider scope: one that involves both pedestrians evt = φv (ut , vt ; Wev ) , (4)
and vehicles. The authors use the term agent to denote a hvt+1 = LSTMv (hvt , evt ; Wv ) , (5)
human or a vehicle in a scene. An attention mechanism was >
used in their network for input scene images to highlight (ût , v̂t ) = Wov hvt + bvo . (6)

VOLUME 12, 2019 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

Location Velocity Tweak Context Vector from Predicted


LSTM Layer LSTM Layer Module Temporal Attention Output

Observation phase Prediction phase

... ... x^ T +1 x^ T +2
c Tl c Tl c Tv c Tv
...
obs obs

+1
obs obs +2 obs +1 obs +2

Temporal Attention Temporal Attention


(Location) (Velocity)

h l1 h l2 h lT obs
h vT obs
u^ T obs
u^ T +1
obs
u^ T +2
obs

v v
h 1 h 2 x^ T x^ T x^ T
obs obs +1 +2
obs

hv0 h v1
...
...
h l0 hl1
u1 u2 uT obs
uT obs +1 uT obs +2

x1 x2 xT obs
xT obs+1 xT obs +2

( xt is the condensed notation for ( x t , y t ); u t is the condensed notation for (u t , v t ) )

FIGURE 1. Our proposed LVTA network. Two LSTM layers are used for the location and the velocity embeddings separately. For each LSTM layer, a temporal
attention mechanism is used to generate context vectors. In the prediction phase, the outputs from the location and velocity LSTM layers are modified by a tweak
module before they are passed to the next time step. To simplify the visualization, the embedding vectors elt and ev
t , for all t, are not shown.

In Eqs. (1)-(6), LSTMl (·) and LSTMv (·) represent the lo- C. TEMPORAL ATTENTION
cation LSTM layer and velocity LSTM layer (the white and The temporal attention mechanism captures the relationships
blue boxes in Figure 1). Functions φl (·) and φv (·) are embed- between a time step of the prediction phase and different
ding functions for the location and the velocity coordinates. time steps of the observed part of the input trajectory. In
The W terms are the different weight matrices, e.g., Wel and a nutshell, it outputs a different weight for each of these
Wev denote the weight matrices of two embedding layers; relationships based on the input that is passed to it. As a
Wl and Wv denote the weight matrices of the two LSTM result, this extra information in temporal attention is able
layers. Similarly, all the b terms denote the bias vectors and to help produce more accurate prediction and improve the
the h terms denote the hidden states of LSTM layers. robustness in predicting trajectories of different lengths.
To simplify the description, the remaining part of this
The embedding layer and the two LSTM layers in the subsection focuses only on the temporal attention mechanism
network are designed to capture the latent representation of for the location LSTM layer. The mechanism for the velocity
trajectories and the complex pedestrian movement patterns LSTM layer is similar.
in the scene. The network also includes two attention mech- Following the use of context vectors in machine transla-
anisms (red boxes in Figure 1) for the location and velocity tion [36] and in trajectory prediction [47] to capture diff-
LSTM layers. The job of these attention mechanisms is to erent attentions at different time steps of the input sequence,
combine the latent information captured in the observation we introduce the context vector clt to the LVTA architec-
phase of each trajectory to yield better predicted trajectories ture. For
in the prediction phase.  the trajectory  observation phase, the hidden states
hl = hl1 , . . . , hlTobs are computed from Eq. (2). During the
trajectory prediction phase (Tobs + 1 6 t 6 Tpred ), Eq. (2)
In the observation phase (t = 1 to Tobs ), (xt , yt ) and
becomes:
(ut , vt ) are directly acquired from each observed trajectory.  
l
In the prediction phase (t = Tobs + 1 to Tobs + Tpred ), the hlt+1 = LSTMl hlt , clt , elt ; W , (7)
inputs at time step t come from the tweak module which
operates on the inputs and predicted outputs at time step t−1. where the context vector clt is defined in terms of all the
The subsections below detail the attention mechanisms of the hidden states in the observation phase, i.e.,
LVTA architecture. Tobs
X
clt = l
βs,t hls . (8)
s=1
4 VOLUME 12, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

• The location-velocity (LV) attention layer, denoted by


fLV , is implemented as a linear fully connected layer
with 4 input neurons and 2 output neurons. If a more
complex model is desired, the attention layer can be
easily replaced by, for instance, a multilayer perceptron.
• The output from the LV attention layer is then passed
through a softmax activation layer to yield two weight
parameters, αtl and αtv as follows:
αtl , αtv = softmax (fLV (x̂t , ŷt , ût , v̂t )) .

(11)

FIGURE 2. Details of the tweak module shown in Figure 1. It has three layers:
The softmax function is used so that αtl and αtv can be
a location-velocity attention layer, a softmax layer, and a tweak layer. treated as probability values, i.e., 0 6 αtl , αtv 6 1, and
αtl + αtv = 1.
• With αtl and αtv from Eq. (11) above and input location
l
All the βs,t terms in the equation above are the attention coordinates (xt , yt ), the job of the final tweak layer is to
weights between time step s 6 Tobs and time step t > Tobs compute the input location and velocity coordinates for
that need to be computed to yield a good estimate of each clt . time step t + 1 using the following equations:
They are defined in terms of the hidden state vectors. In the
paper of Luong et al. [37], the authors describe three different xt+1 = αtl x̂t + αtv (xt + ût ) , (12)
score functions to model the relationship between hls and hlt . yt+1 = +αtl ŷt αtv
(yt + v̂t ) , (13)
l
To get the value of βs,t , we adopt their general score function ut+1 = xt+1 − xt , (14)
given by fT below:
vt+1 = yt+1 − yt . (15)
>
fT hls , hlt = hls WTl hlt

(9) The location-velocity attention mechanism implemented
in the tweak module described above allows the model to
as we found it suitable in capturing the correlation between
learn αtl and αtv through the model training process, update
different components of ht . Here, WTl ∈ RNh ×Nh is a weight
(xt , yt ) and (ut , vt ) at each time step, and keep track of the
matrix that needs to be trained. The softmax P function is then
l Tobs relationship between the location and velocity information
used to normalized all the βs,t terms so that s=1 βs,t = 1,
along the way. Its realisation introduces only a small 4 × 2
i.e.,  weight matrix that requires training.
l exp fT hls , hlt−1 The results from Eqs. (12)-(15) are fed into the network at
βs,t = PTobs . (10)
l l
k=1 exp fT ht−1 , hk time step t + 1. The procedure described above is repeated
until the time step t = Tpred is reached.
The LVTA architecture thus has two extra weight matrices,
WTl and WTv , that require training in the training phase for It should be noted that the two types of attention mecha-
the location and velocity temporal attention mechanisms. nisms described so far differ in two respects:
1) The two temporal attention mechanisms (see the pre-
D. THE TWEAK MODULE vious subsection) operate on the location hidden state
Our LVTA architecture differs from the conventional LSTM vectors and velocity hidden state vectors separately,
sequence generation models such as the widely used Seq2Seq whereas the location-velocity attention layer inside the
model [50] and the Encoder-Decoder model [51]. In conven- tweak module (Figure 2) operates on the location and
tional LSTM models, the output of the LSTM network at the velocity coordinates together.
last time step is directly used as input for the next time step in 2) The temporal attention mechanisms capture the rela-
the decoder phase. In our proposed LVTA architecture, each tionship of the hidden state vectors between the ob-
time step produces two outputs: location coordinates and servation phase and the prediction phase, whereas the
velocity coordinates. We can therefore use a tweak module location-velocity attention layer captures the relation-
(yellow boxes in Figures 1 and 2) to refine them at time step ship between the location and velocity coordinates at
t before passing them on to time step t + 1. the prediction phase only.
Let (x̂t , ŷt ) and (ût , v̂t ) be the output location and velocity
coordinates predicted by the LSTMl and LSTMv layers at IV. EXPERIMENTS
time step t. Rather than using them as inputs for time t + 1, A. DATASETS AND METRICS
they are passed to the tweak module. The role of the tweak Central Station Dataset. This large publicly available
module is to feed (x̂t , ŷt ) and (ût , v̂t ) through a few layers dataset [52] contains over 10,000 trajectories extracted from
to yield better location and velocity coordinates (xt+1 , yt+1 ) a 33 minutes long surveillance video. The scene resolution
and (ut+1 , vt+1 ) for the next time step. The tweak module is 720 (width)× 480 (height) in pixels. We preprocessed the
consists of three layers (Figure 2): dataset by normalizing all the trajectory coordinates so that
VOLUME 12, 2019 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

dropout = 0.1

dropout = 0.2

dropout = 0.5
dropout = 0.1

dropout = 0.2

dropout = 0.5
FIGURE 3. ADE (top row) and FDE (bottom row) on the validation set of the Central Station dataset under different hyperparameter combinations. From left to right,
the dropout rate is 0.1, 0.2, and 0.5 for each column.

they are in the [0, 1] range. We use 10-fold cross-validation also use the normADE, which is the ADE being normalized
to evaluate the performance of our method. Each fold is in with respect to the image size.
turn used as the test set while the remaining folds are used as
the training set. The average performance from the 10 folds B. IMPLEMENTATION DETAILS
is reported for this dataset. Our LVTA method and its variants were trained by the Adam
ETH/UCY Dataset. The combined ETH [6] and UCY [53] optimizer [55] with 0.001 learning rate for 500 epochs with a
dataset is another widely used public dataset for evaluating mini batch size m = 128. The loss function L is given by:
trajectory prediction methods. It contains a total of 5 scenes, m
which have over 1,000 trajectories altogether, known as 1 X
L(Xgt , Xpred ) = kXigt − Xipred k2 , (16)
ETH, HOTEL (both are from ETH), UNIV, ZARA1, and m i=1
ZARA2 (the last three are from UCY). We followed the
common leave-one-out evaluation policy that has been used where Xigt and Xipred denote the ground truth trajectory and
in [11, 22, 30, 20], i.e., we trained on four scenes and tested predicted trajectory of the ith pedestrian in the mini batch. As
on the remaining scene. The labelled coordinates of each the velocity term for each time step is not used for the ADE
pedestrian are given in metres. The prediction results of our and FDE computation, it is not included in the loss function.
method and its variants reported later in the paper are the Our proposed LVTA and its variants were implemented1
performance on the test set. using the Pytorch framework in Python and trained with
an NVIDIA GeForce GTX-1080 GPU. For some of the
Edinburgh Dataset. This dataset consists of trajectories of
experiments on the ETH/UCY dataset, we used an NVIDIA
people walking around the Informatics Forum at the Univer-
Titan XP GPU.
sity of Edinburgh [54]. It covers several months of obser-
vations, resulting in over 92,000 trajectories in total. Same
as [47], 20,000 trajectories and 5,000 trajectories were ran- 1) Hyperparameter Tuning
domly sampled to form the training set and the test set. The We focus on three hyperparameters that are crucial to the
scene images are 640 × 480 pixels, where each pixel covers performance of the proposed trajectory prediction method:
a 24.7mm × 24.7mm region on the ground. This dataset the dropout rate, the hidden dimension Nh of the LSTM
was used specifically for evaluating the generalizability of layers, and the embedding dimension Ne of the embedding
the proposed architecture, i.e., we trained LVTA using the layers.
Central Station dataset and tested its prediction performance To investigate how these three hyperparameters influence
on the test set of this dataset. the performance of LVTA, all the combination of the follow-
Evaluation Metrics. Similar to the previous work [11, 25, ing hyperparameter values were evaluated:
10, 29, 22, 17, 18, 27], we use the average displacement • Nh : [ 32, 64, 128, 256 ]
error (ADE) [6] and the final displacement error (FDE) [11] • Ne : [ 64, 128, 256 ]
metrics to quantitatively evaluate the trajectory prediction • dropout: [ 0.1, 0.2, 0.5 ]
performance of each method. The former is the mean Eu- resulting in 36 experiments in total. The above sets of values
clidean distance between all the points in the predicted and for the hidden dimension and embedding dimension are
ground truth trajectories averaged over all the trajectories. chosen based on the values used in other trajectory prediction
The latter is the average Euclidean distance between their
final points (or the destination points). Where appropriate, we 1 Codes can be found: https://github.com/xuehaouwa/LVTA.

6 VOLUME 12, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

ture with both the tweak module and the temporal attention
mechanism removed. It can be considered as a modified
vanilla LSTM model. The difference between the traditional
vanilla LSTM model and the vanilla LV model is the input
and output of the network. For the vanilla LSTM, the input
and output at each time step are 2D location vectors of the
trajectory in the form (xt , yt ); for the vanilla LV model, these
are vectors of the form (xt , yt , ut , vt ), i.e., Xpred ∈ RTpred ×4
and the predicted trajectory is obtained by extracting the
(xt , yt ) coordinates from the 4D vectors.
The Constant Location-Velocity Attention model (CLVA):
FIGURE 4. A loss plot showing the decrease of Mean Squared Error (MSE)
over the number of training epochs for the training and validation sets. The This variant of LVTA has the tweak module (Figure 2)
optimal hyperparameter values used in the plot are: Ne = Nh = 128; dropout modified by removing both the location-velocity attention
rate = 0.5.
layer (fLV ) and the softmax layer. Rather than training the
network to learn fLV for the optimal values for αtl and αtv ,
methods in the literature, e.g., {Nh = 128, Ne = 64} is used the CLVA model has both αtl and αtv set to 0.5 for all the
in [11] and {Nh = 256, Ne = 128} is used in [21]. time steps in the prediction phase, i.e., αtl = αtv = 0.5, for
For this hyperparameter tuning process, we randomly se- Tobs + 1 6 t 6 Tobs + Tpred .
lected from the Central Station dataset 80% and 10% of The Temporal Attention model (LVT): This variant is the
the trajectories to form the training and validation sets. The LVTA architecture with only the tweak module removed.
ADE and FDE for different combinations of hyperparameter Thus, the location-velocity attention mechanism (see Fig-
values on the validation set are shown in Figure 3. From left ure 2) inside the tweak modeule is not included either. LVT
to right, the dropout rate is 0.1, 0.2 and 0.5 for each column. contains the two temporal attention mechanisms shown in
In each subfigure, a lighter background colour means a lower Figure 1.
error (i.e., better prediction performance). It can be seen in The Location-Velocity Attention model (LVA): This vari-
Figure 3 that the prediction performance of LVTA becomes ant does not have the two temporal attention mechanisms.
worse when the hidden size Nh (same for the embedding It is the same method originally proposed in our previous
size Ne ) is either too small or too large. These results are work [27]. LVA contains the full tweak module, i.e., includ-
expected as Nh and Ne determine the number of parameters ing the location-velocity attention mechanism.
(weights of each layer in the network) that require training.
Large Nh and Ne values result in more parameters and may
3) Methods Being Compared
lead to overfitting problems, whereas small Nh and Ne values
We compare the performance of our LVTA method as well
do not allow the algorithm to model the complex trajectory
as its four variants (vanilla LV, CLVA, LVT, and LVA)
patterns. For all the experiments reported in the rest of this
against 12 methods listed below: Linear, Social Force Model
section, both the embedding size Ne and the hidden size Nh
(SFM) [5], Linear Trajectory Avoidance (LTA) [6], Behaviour
were therefore set to 128 and the dropout rate was set to 0.5,
CNN [32], Vanilla LSTM, SA-GAIL [25], Social-LSTM [11],
as this hyperparameter value combination (see Figure 3) has
Attention-LSTM [47], SGAN [22], Nikhil and Morris [14],
the lowest ADE (9.17) and FDE (17.04) on the validation set.
Liang et al. [21], and SR-LSTM [20].
Figure 4 illustrates the loss plot of the training set and
Depending on the dataset and availability of results, not all
validation set for this optimal hyperparameter setting. The
methods were compared on all datasets.
figure shows how the mean squared error (MSE) varies as
the number of epochs increases. In the region of 300-500
C. RESULTS ON THE CENTRAL STATION DATASET
epochs, the training curve (blue) continues to drop but the
Table 1 shows the trajectory prediction results on the test
decrease in MSE is very small whereas the validation curve
set of the Central Station dataset. We followed the setting
(red) fluctuates slightly and comes to a plateau slowly. The
used in [25] and fixed Tobs = 9 and Tpred = 8. The results
two curves are very close to each other, indicating that the
show that the constant velocity method is only able to learn
network is well behaved and that there is no overfitting issue.
relatively straight path trajectories and performs very poorly.
The MSE that is minimized in the loss function (Eq. 16)
With the attractive forces (moving towards destinations) and
is analogous to the average displacement error (ADE) in
the repulsive forces (avoiding collision with other people or
trajectory prediction.
obstacles) incorporated into the models, both the Social Force
method and the LTA method give more superior performance
2) Ablation Study than the constant velocity method. It is evident from Table 1
We evaluate the prediction performance of three variants of that the more recent deep learning based prediction methods
our LVTA method: (the last 8 rows of the table) are way ahead of the former
The vanilla LV model: This variant is the LVTA architec- 3 prediction methods by a large margin. Comparing with
VOLUME 12, 2019 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

TABLE 1. Prediction errors of different prediction methods on the the ground truth trajectories, LVTA still manages to predict
Central Station dataset.
the turning trend and generate plausible trajectories than the
method normADE ADE FDE vanilla LSTM and LVA methods. These challenging exam-
Linear *
5.86% – – ples demonstrate the effectiveness in the prediction phase of
SFM* [5] 4.45% – – the temporal attention mechanism that is present in LVTA but
LTA* [6] 4.35% – – absent in LVA.
Vanilla LSTM* 2.39% 14.57 27.78
Two failure cases are shown in Figure 5(i) and (j). There
Behaviour CNN* [32] 2.52% – – are two reasons for the failed predictions in these two exam-
SA-GAIL* [25] 1.98% 11.98 23.05 ples. First, the turning movements occur after the observation
part so, using the observed trajectories alone, all the three
Our vanilla LV 2.23% 13.78 23.97
Our CLVA 2.09% 12.29 21.74 methods fail to predict these late changes of walking direc-
Our LVT 1.88% 10.96 20.56 tion. Second, both ground truth trajectories in the prediction
Our LVA 1.65% 10.05 19.43 phase are very sharp U-turns (almost 180◦ ) that are very
Our LVTA 1.55% 9.38 17.22 different from the gentle turns shown in Figures 5(d)-(f)
*
results extracted from [25] earlier where LVTA is still able to give plausible predic-
tions. To improve the prediction results for these cases, more
training data having late and sudden changes of walking
the Behaviour CNN method, even the vanilla LSTM gives direction would be required. It may also help if higher order
a smaller prediction error. The results on this dataset show terms such as acceleration are incorporated into the network
that, for time sequence data, LSTM based methods are more architecture.
suitable than CNN based methods.
The prediction results of vanilla LV, CLVA, LVT, LVA, D. RESULTS ON THE ETH/UCY DATASET
and LVTA are given in the last five rows of Table 1. Our For each test scene of this dataset, we first combined the
LVTA outperforms other baselines and the state-of-the-art trajectories of all the training scenes. We then carried out
methods on all the three metrics. LVA takes the second place a normalization step by setting the origin at the centroid
after LVTA. Although the prediction result of CLVA is worse of the trajectories and scaled them so that all the trajectory
than SA-GAIL, LVT, LVA, and LVTA, it performs better coordinates are in the range [−1, 1]. The same normalization
than other baseline methods. The extra velocity information parameters were applied to the test set. After performing
being passed to vanilla LV demonstrates to be useful as trajectory prediction on the test set, the inverse normalization
the prediction result from vanilla LV is slightly better than was applied to yield the ADE and FDE in metres.
that from vanilla LSTM. Compared to CLVA (having 2.09%
We followed the setting used in [11, 22, 30, 20] and fixed
normalized ADE), the location-velocity attention layer in the
Tobs = 8 and Tpred = 12. Compared to the other two datasets,
tweak module helps LVTA gain significant improvement (at
the number of trajectories in this dataset is relatively small.
1.55% normalized ADE) on the prediction results. If the
As in [20], we therefore augmented the training set by
tweak module is completely removed from the network archi-
tecture as in the variant LVT, the performance is worst than • splitting long trajectories to form trajectories of length
LVTA. This further confirms that the two temporal attention Tobs + Tpred = 20 using a sliding time window of stride
mechanisms and the tweak module in the LVTA architecture size 1; and
work well together and help to reduce the prediction errors. • performing random rotation.
Figure 5 shows a qualitative comparison of some pre- In addition, we also performed trajectory reversal and
diction results from the vanilla LSTM, LVA, and LVTA swapped the x and y-coordinates (equivalent to 90◦ rota-
models. For simple cases (Figure 5(a) and (b)) where the tion). Trajectory rotation is in general not used for data aug-
trajectories are almost linear, all three methods generate mentation as the augmented trajectories might not be valid
trajectories very close to the ground truth trajectories. With (e.g., the trajectories might represent pedestrians walking
the inclusion of the velocity LSTM layer, both LVA and into obstacles). However, since this dataset is about using
LVTA are able to give more precise predicted trajectories training trajectories captured from up to four different scenes
in the examples of turning slightly (Figure 5(c)) and turn- to predict trajectories in the fifth scene, trajectories are not
ing abruptly (Figures 5(d)-(f)) captured in the observation bound to any scene context. It should be noted also that, for
phase. Although LVA and vanilla LSTM both give reasonably our LVTA method and its variants, trajectory reversal does
good trajectory directions, LVTA generates more accurate not provide extra training information to the velocity LSTM
predicted trajectories that almost overlap with the ground layer as the velocity coordinates of a reversed trajectory
truth trajectories. are simply the negated version of the original trajectory;
Extremely challenging cases are shown in Figure 5(g) however, reversed trajectories do provide extra information
and (h). Unlike the turning cases in Figures 5(d)-(f), the to the training of the location LSTM layers of these models.
turning occurs very late in the observed part (blue colour) Table 2 shows the prediction performance of our LVTA
of each trajectory. Although not completely coinciding with versus two baseline methods (Linear and Vanilla LSTM) and
8 VOLUME 12, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

FIGURE 5. Qualitative comparison of prediction results from vanilla LSTM, LVA, and LVTA. Colour codes: blue: input observed trajectories; green: ground truth
trajectories; yellow: prediction by vanilla LSTM; pink: prediction by LVA; red: prediction by LVTA. The background scene used in each subfigure is the first frame of
the prediction phase. Two failure cases are given in the last two subfigures ((i) & (j)).

TABLE 2. Prediction errors (in metres) of different prediction methods on the ETH/UCY dataset. The results marked with ∗ are taken from [22].Top-1, top-2, top-3
results are shown in blue, red, and green. The results from SR-LSTM [20] come from the best 20m × 20m neighbourhood region reported by the authors.

Performance (ADE / FDE in metres)


Datasets
Social- SGAN [22] Nikhil & SR-
Linear∗ Vanilla LSTM∗ Liang et al. [21] LVTA (ours)
LSTM∗ [11] 1V-1∗ Morris [14] LSTM [20]
ETH 1.33 / 2.94 1.09 / 2.41 1.09 / 2.35 1.13 / 2.21 1.04 / 2.07 0.63 / 1.25 0.88 / 1.98 0.57 / 1.10
HOTEL 0.39 / 0.72 0.86 / 1.91 0.79 / 1.76 1.01 / 2.18 0.59 / 1.17 0.37 / 0.74 0.36 / 0.74 0.42 / 0.69
UNIV 0.82 / 1.59 0.61 / 1.31 0.67 / 1.40 0.60 / 1.28 0.57 / 1.21 0.51 / 1.10 0.62 / 1.32 0.55 / 1.19
ZARA1 0.62 / 1.21 0.41 / 0.88 0.47 / 1.00 0.42 / 0.91 0.43 / 0.90 0.41 / 0.90 0.42 / 0.90 0.42 / 0.92
ZARA2 0.77 / 1.48 0.52 / 1.11 0.56 / 1.17 0.52 / 1.11 0.34 / 0.75 0.32 / 0.70 0.34 / 0.75 0.35 / 0.75
Average 0.79 / 1.59 0.70 / 1.52 0.72 / 1.54 0.74 / 1.54 0.59 / 1.22 0.45 / 0.94 0.52 / 1.14 0.46 / 0.92

five state-of-the-art methods on the five scenes. On each


row, the top three performing methods are highlighted in
blue, red, and green. The last row of the table shows the
average performance over the five scenes. Although both
the SGAN method [22] and Liang et al.’s method [21]
can generate multiple trajectory predictions through their
GAN based architectures and have better prediction results,
for fair comparison, we only include their single prediction
performances in the table. On the other hand, SoPhie [30] FIGURE 6. Illustration of predicted trajectories on the ETH and HOTEL
is not included in the comparison in Table 2 because its scenes of the ETH/UCY dataset. Colour codes: blue: input observed
trajectories; green: ground truth trajectories; pink: prediction by LVTA.
single prediction results are not reported by the authors. For
the SR-LSTM method [20], multiple results with different
configurations have been reported in [20] and the authors’
best prediction results are shown in the table. It should be noted that the training and test trajectories of
Unlike the traditional CNN architecture which focuses all the five scenes in the ETH/UCY dataset cover roughly
on the spatial information only, the CNN based prediction a 24m × 24m region. The SR-LSTM model incorporates
method of Nikhil and Morris [14] uses parallelizable con- a 20m × 20m neighbourhood region and a pedestrian-wise
volutional layers to handle temporal dependencies. Their attention layer to model the influence from other pedestrians.
method appears to give prediction results that are comparable This large neighbourhood region used in SR-LSTM is almost
with other LSTM based methods. the entire scene. This means that much computational work
For the ETH scene, our LVTA method outperforms all is needed in SR-LSTM to store the hidden states of other
other methods with a 0.57m ADE and 1.10m FDE, leading pedestrians. Compared to SR-LSTM, LVTA is computation-
a comfortable margin from the runner-up SR-LSTM. For the ally more efficient as only the POI information is required to
other four scenes, our LVTA’s ADEs and FDEs are among predict its trajectory. Furthermore, when the neighbourhood
the top three methods. On average, LVTA takes the first spot region shrinks to 4m × 4m, the average prediction errors of
on FDE at 0.92m and the second spot on ADE at 0.46m, SR-LSTM increase to 0.49m for ADE and 1.06m for FDE
which is only 0.01m behind the winner SR-LSTM. (These results are reported in Table 2 of [20]). If we use these
VOLUME 12, 2019 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

TABLE 3. Error performance (in metres) of different prediction


methods on the Edinburgh dataset.

method transfer ADE FDE


*
SFM [5] 3.124 3.909
Social-LSTM* [11] 1.524 2.510
Attention-LSTM* [47] 0.986 1.311
LVA-8 X 1.194 2.176
LVA-16 X 1.540 2.417
FIGURE 7. The ADE and FDE of prediction results from our LVA and LVTA
methods for different prediction horizons (Tph ). Dash lines represent models
LVTA-8 X 0.890 1.706
that are trained with Tpred = 16 and solid lines are models trained with LVTA-16 X 1.182 1.914
Tpred = 8. Results of using LVA and LVTA are shown in blue and red *
respectively.
results extracted from [47]

test set for 5 different Tph values: 8, 10, 12, 14, and 16. As
errors in our Table 2 instead, our average ADE will move to expected, both the ADE and FDE increase with increasing
the first place, ahead of SR-LSTM. Tph . Comparing LVA-8 with LVA-16, it shows that it is
Some prediction results from the ETH/UCY datasets are not always an advantage to train a network with large Tpred
illustrated in Figure 6. The image coordinates of the over- when Tph is small, e.g., LVA-16 performs worse than LVA-8
laid trajectories are converted from the homography matrix when Tph = 8. Only when Tph increases to 16 that LVA-
provided for each scene in the dataset. The figure shows that 16 slightly outperforms LVA-8. However, when comparing
LVTA can generate plausible trajectories for different cases LVTA-8 against LVTA-16, we do not see the same pattern.
such as stopping and slowing down. For small Tph values (e.g., when Tph = 8), LVTA-8 outper-
forms LVTA-16 as expected. Furthermore, LVTA-8 is able to
E. GENERALIZABILITY STUDY maintain its superior performance even when Tph > 12. This
1) Predicting Trajectories of Different Prediction Horizons demonstrates the effectiveness of the extra temporal attention
mechanism in the architecture.
To distinguish prediction lengths of trajectories in the train-
Some example trajectories generated by LVTA-8 and
ing and testing stages, we adopt the term prediction hori-
LVTA-16 for the Central Station test set are shown as red and
zon [23], denoted by Tph from hereon, to mean the prediction
pink trajectories in Figure 8. The ground truth trajectories for
length in the testing stage. The objective of the experiments in
the prediction phase and the input observed trajectories are
this section is to evaluate how well the LVTA architecture can
given in green and blue. The first row has prediction horizon
be generalized to produce trajectories of different prediction
Tph = 8 and the second row has Tph = 16. It can be seen
horizons.
from the figure that LVTA-16 performs better than LVTA-8
For the LVA and LVTA models that have been trained
in some occasional turning cases when Tph = 16. However,
to predict n time steps, i.e., Tpred = n, we represent them
at the beginning of the prediction phase, the location coordi-
as LVA-n and LVTA-n. Once a network is trained, it can
nates predicted by LVTA-8 are more accurate. On average,
be used to predict trajectories of any Tph value. With two
LVTA-8 has smaller ADE and FDE than LVTA-16 for all the
settings {Tobs = 9, Tpred = 8} and {Tobs = 9, Tpred = 16} on
Tph values in our experiments (Figure 7).
the training set of the Central Station dataset, we trained
four separate models: LVA-8, LVTA-8, LVA-16, and LVTA-
16. Their performance on predicting trajectories of various 2) Transferring to Other Scenes
prediction horizon values were then compared on the test set. Recall that the Edinburgh dataset has more training trajec-
Prior to the training stage, trajectories of a suitable length tories than the Central Station dataset (see Section IV-A).
need to be extracted from the training set. For LVA-16, So it will be more advantageous to train a prediction model
trajectories must be at least Tobs + 16 = 9 + 16 = 25 on the Edinburgh dataset and test it on the same dataset.
time steps long; for LVA-8, they only need to have at least The aim of the experiments conducted in this subsection is
Tobs + 8 = 9 + 8 = 17 time steps. In total, only 9,328 to show the generalizability of the LVTA architecture on
trajectories could be used to train LVA-16; however, 14,739 transferring what is learned from a scene to another scene.
trajectories were available to train LVA-8. Thus, while one It should be noted that, except for ZARA1 and ZARA2,
might expect that a network that is trained to predict tra- the experiments on the ETH/UCY dataset described in the
jectories of a longer prediction length should perform better previous subsection are also tests on the generalizability as
than one that is trained to predict trajectories of a shorter the training scenes and the test scene are different. Since
prediction length, it is not always the case as the former existing methods report their performance on the Edinburgh
network is exposed to fewer, and therefore less diverse, dataset in metres, to be consistent with them, we use the pixel
trajectories. to metre relationship described in Section IV-A to convert the
Figure 7 shows the ADEs and FDEs of the predicted ADE and FDE values from pixels to metres.
trajectories from the four models mentioned above on the Without any retraining, the four models, namely LVA-
10 VOLUME 12, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

FIGURE 8. Prediction results of different prediction horizons: Tph = 8 frames (first row) and Tph = 16 frames (second row) on the Central Station test set. Colour
codes: blue: input observed trajectories; green: ground truth trajectories; red: LVTA-8; pink: LVTA-16. The background scene in each subfigure is the first frame of
the prediction phase.
LVA-8 LVA-16 LVTA-8 LVTA-16
prediction horizaon
prediction horizaon

FIGURE 9. ADEs (top row) and FDEs (bottom row) of LVA-8, LVA-16, LVTA-8, and LVTA-16 under different observed lengths (Tobs ) and different prediction horizons
(Tph ) on the Edinburgh test set. All the models were trained on the Central Station training set. The lighter is the colour of a cell, the better is the performance.

8, LVA-16, LVTA-8, and LVTA-16, that have been trained results that can be observed from the figure also. Firstly,
on the Central Station dataset (in pixels) described in Sec- along each row (i.e., when Tph is fixed), the cell for Tobs = 9
tion IV-C are directly used to predict trajectories in the test has the lowest ADE and FDE. This is not unexpected as these
set of the Edinburgh dataset. To be consistent with the test models were originally trained with Tobs = 9. Secondly, as
results from other methods reported in [47], Tobs and Tph one moves away from the Tobs = 9 column, the ADEs and
were both set to 20. In Table 3, a method having a tick under FDEs of the two LVTA models increase at a slower rate than
the transfer column denotes that it has been trained on the those of the LVA counterparts. This can be observed from the
Central Station training set instead of the Edinburgh training more drastic change of colour along the columns of the two
set. The prediction results show that LVTA-8 and LVTA-16 LVA tables. Thirdly, if one compares the right bottom regions
have similar performance as Attention-LSTM while both (where both Tobs and Tph are large) of the ADE and FDE
LVTA-8 and LVTA-16 outperform the LVA counterparts. tables for LVA-8 and LVA-16, then one should notice that,
Even not being trained on trajectories from the same scene, for the cells corresponding to the same Tobs and Tph values,
LVTA-8 outperforms all the other techniques on the ADE. LVA-8 has smaller ADEs than LVA-16 but it has larger FDEs
than LVA-16. These results indicate that, toward the end of
To further investigate the generalizability of LVA and
the trajectories, the predicted locations from LVA-8 deviate
LVTA, we test the two pretrained models above on the
more from the ground truth than those predicted by LVA-16.
Edinburgh test set with the Tobs value ranging from 7 to 21
However, with the help of the extra temporal attention in
and Tph ranging from 6 to 18. It is clear in Figure 9 that the
LVTA, no similar results are observed between the large
ADEs and FDEs of both LVTA-8 and LVTA-16 are lower
(Tobs , Tph ) regions of LVTA-8 and LVTA-16. The above
than those of LVA-8 and LVA-16. There are some other key
VOLUME 12, 2019 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

FIGURE 10. Prediction results on the Edinburgh dataset. Colour codes: blue: input observed trajectories; green: ground truth trajectories; pink: LVTA-8; red:
LVTA-16. Top row: Tobs = 9, Tph = 8; bottom row: Tobs = 20, Tph = 20.

experiments confirm that the improvement in prediction of TABLE 4. Computation time and performance comparison
LVTA over LVA is significant. They also demonstrate the
Training Time ADE ↓ FDE ↓
advantage of incorporating the temporal attention mechanism Vanilla LSTM 1× – –
into the network architecture. LVA 2.34 × 31.78% 30.49%
Figure 10 shows some examples of predicted trajectories LVTA 2.89 × 36.93% 37.80%
from LVTA-8 and LVTA-16 on the Edinburgh dataset. On
the top row, Tobs = 9 and Tph = 8; on the bottom row,
Tobs = Tph = 20. As expected, when the two LVTA models computed as follows. Let van be the ADE (similarly for the
generate trajectories with a smaller prediction horizon (top FDE) of the vanilla LSTM method and  be the ADE of LVA
row), they perform better than cases where the prediction (similarly for LVTA). The error reduction is defined as (van −
horizon is quite large (bottom row). Although the trajectories )/van × 100%. The Table shows a clear positive correlation
generated by the two models still follow the directions of the between the training time and the improvement in trajectory
ground truth trajectories closely, deviations of the end points prediction of the three methods, with LVTA achieving the
of the predicted and ground truth trajectories are noticeable smallest ADE and FDE.
in the last three cases in the bottom row of the figure.
V. CONCLUSION
F. COMPUTATION TIME We have presented a pedestrian trajectory prediction method
To analyze the computation time of LVA and LVTA, the that comprises a location LSTM layer, a velocity LSTM
vanilla LSTM prediction method is used as the baseline for layer, and a tweak module which incorporates a joint
comparison as it is the basic LSTM trajectory prediction location-velocity attention layer. Temporal attention mecha-
method. All the methods were implemented under the same nisms are used in the two LSTM layers. Experimental results
coding environment (Pytorch version 0.3.1 and Python ver- demonstrate effectiveness of the temporal attention mecha-
sion 3.6) and trained on the same desktop having a GTX- nism in our LVTA method, giving significant improvement to
1080 GPU on the Central Station dataset. The results of the prediction results than our previous LVA method. Com-
the training time of these three methods are summarised pared to existing pedestrian trajectory prediction methods,
in Table 4. The training time of LVA is more than twice our LVTA method outperforms several methods on the Cen-
the time of the vanilla LSTM. This is due to two LSTM tral Station and Edinburgh datasets and performs competi-
layers (velocity layer and location layer) being used in LVA, tively against recent state-of-the-art methods on the complex
compared to only one LSTM layer in the vanilla LSTM ETH/UCY dataset. Furthermore, our method is a simpler
method. The location-velocity attention mechanism is also method in that it does not require scene context information
responsible for the extra training time in LVA. Similar to or trajectories of neighbouring pedestrians in the scene.
LVA, LVTA also requires more than twice the training time of We have also thoroughly evaluated the generalizability
the vanilla LSTM. Compared to LVA, the additional training of LVTA in terms of using different observed lengths and
time required by LVTA mainly comes from the extra tempo- prediction lengths as well as applying a pretrained LVTA
ral attention mechanism. model to predict trajectories from a different scene. The
The last two columns of Table 4 denote the reduction in results of these evaluations show that LVTA can yield good
ADE and FDE of LVA and LVTA compared to the baseline prediction results in such circumstances, even when the pre-
vanilla LSTM. The percentage values in these columns are diction length is different from that in the pretrained model.
12 VOLUME 12, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

Compared to our previous LVA method, LVTA demonstrates [15] N. Deo and M. M. Trivedi, “Convolutional social pooling for
even better generalizability. vehicle trajectory prediction,” in CVPR Workshop, June 2018.
[16] Y. Xu, Z. Piao, and S. Gao, “Encoding crowd interaction with
In our proposed LVTA method, the two temporal attention deep neural network for pedestrian trajectory prediction,” in
mechanisms and location-velocity (LV) attention mechanism CVPR, June 2018.
are separately handled: the temporal mechanisms manipulate [17] I. Hasan, F. Setti, T. Tsesmelis, A. Del Bue, F. Galasso,
the hidden states from the LSTMs to create context vectors, and M. Cristani, “MX-LSTM: Mixing tracklets and vislets to
whereas the LV attention mechanism is a bit obscure and jointly forecast trajectories and head poses,” in CVPR, June
2018.
hidden inside the tweak module. One possible future ex- [18] H. Xue, D. Q. Huynh, and M. Reynolds, “SS-LSTM: A hier-
tension is to jointly consider these two types of attentions archical LSTM model for pedestrian trajectory prediction,” in
by making some changes to the network structure of the WACV. IEEE, 2018, pp. 1186–1194.
prediction model. This may help improve the prediction per- [19] S. Haddad, M. Wu, H. Wei, and S. K. Lam, “Situation-aware
formance further. Our other future research directions include pedestrian trajectory prediction with spatio-temporal attention
model,” in 24th Computer Vision Winter Workshop, 2019, pp.
incorporating LVTA into a trajectory analysis system that an- 4–13.
alyzes pedestrian trajectories for abnormal movement pattern [20] P. Zhang, W. Ouyang, P. Zhang, J. Xue, and N. Zheng, “SR-
detection, trajectory clustering, and trajectory counting for LSTM: State Refinement for LSTM Towards Pedestrian Tra-
surveillance applications. jectory Prediction,” in CVPR, June 2019.
[21] J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and
L. Fei-Fei, “Peeking into the future: Predicting future person
REFERENCES activities and locations in videos,” in CVPR, June 2019.
[1] K. Saleh, M. Hossny, and S. Nahavandi, “Contextual recurrent [22] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi,
predictive model for long-term intent prediction of vulnerable “Social GAN: Socially acceptable trajectories with generative
road users,” IEEE Transactions on Intelligent Transportation adversarial networks,” in CVPR, June 2018.
Systems, pp. 1–11, 2019. [23] A. Vemula, K. Muelling, and J. Oh, “Modeling cooperative
[2] W. Zhang, L. Sun, X. Wang, Z. Huang, and B. Li, “Seabig: navigation in dense human crowds,” in ICRA. IEEE, 2017,
A deep learning-based method for location prediction in pp. 1685–1692.
pedestrian semantic trajectories,” IEEE Access, vol. 7, pp. [24] F. Bartoli, G. Lisanti, L. Ballan, and A. Del Bimbo,
109 054–109 062, 2019. “Context-aware trajectory prediction,” arXiv preprint
[3] C. Wang, L. Ma, R. Li, T. S. Durrani, and H. Zhang, “Explor- arXiv:1705.02503, 2017.
ing trajectory prediction through machine learning methods,” [25] H. Zou, H. Su, S. Song, and J. Zhu, “Understanding human be-
IEEE Access, vol. 7, pp. 101 441–101 452, 2019. haviors in crowds by imitating the decision-making process,”
[4] B. Völz, H. Mielenz, I. Gilitschenski, R. Siegwart, and J. Ni- in AAAI, 2018.
eto, “Inferring pedestrian motions at urban crosswalks,” IEEE [26] C. Schöller, V. Aravantinos, F. Lay, and A. Knoll, “What the
Transactions on Intelligent Transportation Systems, vol. 20, constant velocity model can teach us about pedestrian motion
no. 2, pp. 544–555, Feb 2019. prediction,” in arXiv: 1903.07933, 2019.
[5] D. Helbing and P. Molnar, “Social force model for pedestrian [27] H. Xue, D. Huynh, and M. Reynolds, “Location-velocity
dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995. attention for pedestrian trajectory prediction,” in WACV, Jan
[6] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll 2019, pp. 2038–2047.
never walk alone: Modeling social behavior for multi-target [28] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg, “Who
tracking,” in ICCV. IEEE, 2009, pp. 261–268. are you with and where are you going?” in CVPR. IEEE,
[7] I. Karamouzas, B. Skinner, and S. J. Guy, “Universal power 2011, pp. 1345–1352.
law governing pedestrian interactions,” Physical review let- [29] H. Xue, D. Huynh, and M. Reynolds, “Bi-Prediction:
ters, vol. 113, no. 23, p. 238701, 2014. Pedestrian trajectory prediction based on bidirectional LSTM
[8] D. Xie, S. Todorovic, and S. C. Zhu, “Learning and inferring" classification,” in International Conference on Digital Image
dark matter" and predicting human intents and trajectories in Computing: Techniques and Applications, 2017, pp. 307–314.
videos,” in ICCV, Dec 2013, pp. 2224–2231. [30] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose,
[9] S. Kim, S. J. Guy, W. Liu, D. Wilkie, R. W. Lau, M. C. Lin, H. Rezatofighi, and S. Savarese, “SoPhie: An attentive gan for
and D. Manocha, “BRVO: Predicting pedestrian trajectories predicting paths compliant to social and physical constraints,”
using velocity-space reasoning,” The International Journal of in CVPR, June 2019.
Robotics Research, vol. 34, no. 2, pp. 201–217, 2015. [31] A. Graves, “Generating sequences with recurrent neural net-
[10] Y. Li, “Pedestrian path forecasting in crowd: A deep spatio- works,” arXiv preprint arXiv:1308.0850, 2013.
temporal perspective,” in ACMMM. ACM, 2017, pp. 235– [32] S. Yi, H. Li, and X. Wang, “Pedestrian behavior understand-
243. ing and prediction with deep neural networks,” in ECCV.
[11] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, F.-F. Li, and Springer, 2016, pp. 263–279.
S. Savarese, “Social LSTM: human trajectory prediction in [33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
crowded spaces,” in CVPR, June 2016, pp. 961–971. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative
[12] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, adversarial networks,” in NIPS, 2014, pp. 2672–2680.
and M. Chandraker, “DESIRE: distant future prediction in [34] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo,
dynamic scenes with interacting agents,” in CVPR, 2017. “Stargan: Unified generative adversarial networks for multi-
[13] L. Sun, Z. Yan, S. M. Mellado, M. Hanheide, and T. Duckett, domain image-to-image translation,” in CVPR, 2018, pp.
“3DOF pedestrian trajectory prediction learned from long- 8789–8797.
term autonomous mobile robot deployment data,” in ICRA. [35] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and
IEEE, 2018, pp. 1–7. J. Matas, “Deblurgan: Blind motion deblurring using condi-
[14] N. Nikhil and B. T. Morris, “Convolutional neural network for tional adversarial networks,” in CVPR, 2018, pp. 8183–8192.
trajectory prediction,” in ECCV Workshop, September 2018. [36] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
VOLUME 12, 2019 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2977747, IEEE Access

H. Xue et al.: A Location-Velocity-Temporal Attention LSTM Model for Pedestrian Trajectory Prediction

lation by jointly learning to align and translate,” in ICLR, HAO XUE received the B.S. and M.S. degrees
2015. from Harbin Institute of Technology, Harbin,
[37] T. Luong, H. Pham, and C. D. Manning, “Effective approaches China in 2014 and 2016 respectively. He is cur-
to attention-based neural machine translation,” in EMNLP, rently working toward the Ph.D. degree in the
2015, pp. 1412–1421. Department of Computer Science and Software
[38] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, Engineering, The University of Western Australia.
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image His research is mainly on the aspect of pedestrian
caption generation with visual attention,” in ICML, 2015, pp. trajectory prediction.
2048–2057.
[39] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle,
and A. Courville, “Describing videos by exploiting temporal
structure,” in ICCV, 2015, pp. 4507–4515.
[40] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global
context-aware attention LSTM networks for 3D action recog-
nition,” in CVPR, 2017, pp. 1647–1656.
[41] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C.
Kot, “Skeleton-based human action recognition with global
context-aware attention LSTM networks,” IEEE Transactions
on Image Processing, vol. 27, no. 4, pp. 1586–1599, 2018.
[42] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spa-
tiotemporal attention for video-based person re-identification,” DU Q. HUYNH (SM’15) received the Ph.D. de-
in CVPR, 2018, pp. 369–378. gree in computer vision from The University of
[43] L. Wu, Y. Wang, X. Li, and J. Gao, “Deep attention-based spa- Western Australia (UWA), Perth, WA, Australia,
tially recursive networks for fine-grained visual recognition,” in 1994. Since 1994, she has been with the Aus-
IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1791– tralian Cooperative Research Centre for Sensor
1802, May 2019. Signal and Information Processing and Murdoch
[44] M. Yang, W. Tu, J. Wang, F. Xu, and X. Chen, “Attention University, Perth. She is currently an Associate
based LSTM for target dependent sentiment classification,” in Professor with the Department of Computer Sci-
AAAI, 2017. ence and Software Engineering, UWA. She has
[45] F. Karim, S. Majumdar, H. Darabi, and S. Chen, “LSTM fully previously researched on shape from motion, mul-
convolutional networks for time series classification,” IEEE tiple view geometry, and 3-D reconstruction. Her current research interests
Access, vol. 6, pp. 1662–1669, 2018. include visual target tracking, video image processing, machine learning,
[46] A. Vemula, K. Muelling, and J. Oh, “Social attention: Model- and pattern recognition.
ing attention in human crowds,” in ICRA, May 2018, pp. 1–7.
[47] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Soft+
hardwired attention: An LSTM framework for human trajec-
tory prediction and abnormal event detection,” Neural net-
works, vol. 108, pp. 466–478, 2018.
[48] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and
S. Savarese, “CAR-Net: Clairvoyant attentive recurrent net-
work,” in ECCV, September 2018.
[49] agrimgupta92, “agrimgupta92/sgan,” Jun 2018. [Online].
Available: https://github.com/agrimgupta92/sgan
[50] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks,” in NIPS, 2014, pp. 3104– MARK REYNOLDS (M’09) received the Bache-
3112. lor of Science degree (First-Class Hons.) in pure
[51] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, mathematics and statistics from the University of
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase Western Australia (UWA), Perth, WA, Australia,
representations using RNN encoder-decoder for statistical ma- in 1984, the Ph.D. degree in computing from
chine translation,” EMNLP, pp. 1724–1734, 2014. the Imperial College of Science and Technology,
[52] B. Zhou, X. Wang, and X. Tang, “Understanding collec- University of London, London, U.K., 1989, and
tive crowd behaviors: Learning a mixture model of dynamic the Diploma of Education degree from UWA in
pedestrian-agents,” in CVPR. IEEE, 2012, pp. 2871–2878. 1989. He is a Professor and the Head of the School
[53] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by of Physics, Mathematics and Computing, UWA.
example,” in Computer Graphics Forum, vol. 26, no. 3. Wiley His current research interests include artificial intelligence, optimization of
Online Library, 2007, pp. 655–664. schedules and real-time systems, optimization of electrical power distribu-
[54] B. Majecka, “Statistical Models of Pedestrian Behaviour in the tion networks, machine learning, and data analytics.
Forum,” Master’s thesis, School of Informatics, University of
Edinburgh, 2009.
[55] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” arXiv preprint arXiv:1412.6980, 2014.

14 VOLUME 12, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like