Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Buying or Browsing? : Predicting Real-time Purchasing Intent

using Attention-based Deep Network with Multiple Behavior
Long Guo, Lifeng Hua, Rongfei Jia, Binqiang Zhao, Xiaobo Wang, Bin Cui †
Alibaba Group, Beijing & Hangzhou, China
† School of EECS Key Laboratory of High Confidence Software Technologies (MOE), Peking University
{,issac.hlf,rongfei.jrf,binqiang.zhao,yongshu.wxb},[email protected]

ABSTRACT ACM Reference Format:

E-commerce platforms are becoming a primary place for people to Long Guo, Lifeng Hua, Rongfei Jia, Binqiang Zhao, Xiaobo Wang, Bin Cui † .
2019. Buying or Browsing? : Predicting Real-time Purchasing Intent using
find, compare and ultimately purchase products. One of the fun-
Attention-based Deep Network with Multiple Behavior. In The 25th ACM
damental questions that arises in e-commerce is to predict user SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19),
purchasing intent, which is an important part of user understand- August 4–8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 9 pages.
ing and allows for providing better services for both sellers and
customers. However, previous work cannot predict real-time user
purchasing intent with a high accuracy, limited by the representa- 1 INTRODUCTION
tion capability of traditional browse-interactive behavior adopted. In the internet era, large e-commerce platforms such as Taobao
In this paper, we propose a novel end-to-end deep network, named and Amazon are becoming a primary place for people to find, com-
Deep Intent Prediction Network (DIPN), to predict real-time user pare and ultimately purchase products. As an important part of
purchasing intent. In particular, besides the traditional browse- user understanding, it is crucial to know whether a customer is
interactive behavior, we collect a new type of user interactive be- buying or just browsing on an e-commerce application, as it allows
havior, called touch-interactive behavior, which can capture more for providing better services for both sellers and customers. From
fine-grained real-time user features. To combine these behavior the perspective of the sellers, knowing users’ current purchasing
effectively, we propose a hierarchical attention mechanism, where intent can increase their sales volume and profit margin. When
the bottom attention layer focuses on the inner parts of each be- the e-commerce platform has increased confidence that a subset of
havior sequence while the top attention layer learns the inter-view users are more likely to purchase, it can perform some proactive
relations between different behavior sequences. In addition, we actions to maximize conversion based on this information. The
propose to train DIPN with multi-task learning to better distin- platform may offer time-limited coupons or create bundles of com-
guish user behavior patterns. In the experiments conducted on a plementary products to push the users to complete their purchases.
large-scale industrial dataset, DIPN significantly outperforms the From the perspective of the customers, recognizing users’ current
baseline solutions. Notably, DIPN gains about 18.96% improvement buying or browsing intent is vital for the e-commerce platform
on AUC than the state-of-the-art solution only using traditional to set appropriate strategies for the recommendation system and
browse-interactive behavior sequences. Moreover, DIPN has been search engine to improve user experience.
deployed in the operational system of Taobao. Online A/B testing Previous studies focus on leveraging traditional user behavior,
results with more than 12.9 millions of users reveal the potential of which we call browse-interactive behavior, to predict users’ purchas-
knowing users’ real-time purchasing intent. ing intent [15, 18, 19, 23]. However, limited by the representation
capability and frequency of occurrence of the browse-interactive
CCS CONCEPTS behavior, e.g., browse, search or collect a product, it is hard to pre-
• Information systems → Recommender systems; • Applied dict users’ real-time purchasing intent depending solely on these
computing → Online shopping. actions. In other words, these actions contain insufficient informa-
tion about user behavior patterns that would lead to a purchase in a
KEYWORDS short time. The purchase intent of a customer may slowly build over
time and may not instantaneously lead to a purchase. As a result, it
e-commerce, recommendation system, purchasing intent prediction,
is challenging to identify the moment when the customer finally
hierarchical attention mechanism, multiple behavior
places the order. To this end, we need some more fine-grained
behavior data to model user purchasing behavior patterns.
Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

software on mobile devices, we take advantage of the sensors and superiority of DIPN in predicting purchasing intent. In par-
accelerometers of the mobile devices to automatically glean the ticular, DIPN has been deployed in the operational system
real-time context of user interactions, such as the swipe and tap of Taobao and adopted in the coupon allocation task at a
actions. Compared with the browse-interactive actions, the touch- shopping festival. Online A/B testing shows the benefits of
interactive actions occur more frequently. As shown in Table 1, knowing users’ real-time purchasing intent.
the number of swipe actions and tap actions generated per user
The rest of the paper is organized as follows. We discuss related
per day are 37.7 times and 9.3 times more than that of the browse-
work in Section 2, followed by the data description in Section 3. We
interactive actions, respectively. As a result, the touch-interactive
describe the design of DIPN model in Section 4 and give an overview
behavior contains more rich information about user behavior pat-
of the deployment of DIPN in Section 5. We present experiments in
terns. For example, we find that some customers would browse
Section 6 and conclude the paper in Section 7.
the product comments for a long time before they place the order.
Such typical patterns can be easily captured by using the touch-
interactive behavior. By combing the traditional browse-interactive 2 RELATED WORK
behavior and the new touch-interactive behavior, we are able to 2.1 Purchasing Intent Prediction
model the user behavior patterns more comprehensively.
The problem of purchasing intent prediction has been heavily stud-
However, there exist several challenges in predicting users’ real-
ied, with a variety of classic machine learning and deep learning
time purchasing intent. First, the touch-interactive behavior con-
modelling techniques employed. The earliest work come from the
tains less semantic information than the browse-interactive behav-
RecSys 2015 challenge [2], which provides a public dataset con-
ior. Therefore, it is challenging to extract useful features from these
sisting of 9.2 million user-item click sessions. Given a session, the
data to improve the prediction performance. Second, it is necessary
goal of the challenge is to predict whether the user is going to
to figure out an effective fusion mechanism to combine the browse-
buy something or not within this session. Romov et al. [15] won
interactive behavior and the touch-interactive behavior in order
the competition using GBM with extensive feature engineering on
to bring their advantages into full play. Third, due to the complex-
session summarizing. The other feature-based work includes the
ity of the browsing behavior where the customers with different
ensemble model with neural net and GBM used by [23] and the
purchasing intent can appear to be very similar, it is essential to
deep belief networks and stacked denoising auto-encoders by [22].
capture common features that can well depict the customers and
To reduce the feature engineering work, several work [19, 20, 25]
unique features that would lead to different purchasing behavior.
adopt the recurrent neural network (RNN) to model the sequence
In this paper, we propose a novel end-to-end deep network,
nature of sessions, where a bi-directional LSTM is used in [19, 25]
named Deep Intent Prediction Network (DIPN), for the real-time
and a mixture of LSTM is used in [20].
purchasing intent prediction. In DIPN, the user behavior features
Our work is distinguished from previous work in the following
are automatically learned from the raw data without the need of ex-
aspects. First, given a history session, our goal is to predict a user’s
tensive feature engineering. In particular, we propose a hierarchical
subsequent purchasing behavior within a given time, while the goal
attention mechanism to fuse the views extracted from different in-
of previous work is to predict the purchasing behavior within the
teractive behavior sources. In the bottom attention layer, we design
session. Our setting is more realistic because in reality we should
an intra-view attention mechanism which focuses on the inner parts
predict the future behavior based on the current incomplete session.
of the behavior sequence. In the top attention layer, we propose
Second, the key difference of our work is that we collect touch-
an inter-view attention mechanism that learns the inter-view rela-
interactive actions to capture the real-time user behavior patterns.
tions between different behavior sequences. In addition, we propose
As a result, we need to handle several data sources in our model
to train the real-time and long-term purchasing intent simultane-
while previous work only deal with a single source.
ously with the same model. With the multi-task learning, DIPN
can capture common features that well depict the customers and
unique features that would lead to different purchasing behavior. 2.2 Sequence Classification
The contribution of the paper can be summarized as follows: The task of purchasing intent prediction is closely related to se-
quence classification. A brief survey by [26] categorizes the se-
• We collect a new type of user behavior, the touch-interactive quence classification methods into three groups: feature based
behavior, which contains rich information about user behav- methods [1, 13, 29], sequence distance based methods [10, 17, 24],
ior patterns. Together with the traditional browse-interactive and model based methods [4, 27, 31]. Our work is related to the
behavior, we are able to depict a user from different views model based approach, where we use an end-to-end deep network
for better performance of purchasing intent prediction. to model the sequences and save extensive feature engineering
• We propose a deep network DIPN for real-time purchasing work. Our work is also related to sentence classification in natural
intent prediction. A novel hierarchical attention mechanism language processing [9, 11, 30]. Text sentences and time series data
is proposed to fuse multiple views extracted from different are similar to each other in that they are both ordered sequences
interactive behavior sources. In addition, multi-task learning in nature. However, the semantic information contained in these
is introduced to better distinguish user behavior patterns. two kinds of sequences are definitely different. Our work differs
• We conduct extensive experiments to evaluate the perfor- from the previous work in that we need to handle several different
mance of DIPN in both offline and online settings. Experi- data sources with different formats while in traditional sequence
mental results on a large-scale industrial dataset shows the classification, the data usually comes from a single source.

Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Type Timestamp Page Position Start Position End Position Duration

Open Page 1531640194337 Page_Home / / / /
Tap 1531640195116 Page_Home x:336.5, y:473.0 / / /
Swipe 1531640196199 Page_Home / x:242.5, y:573.5 x:234.0, y:558.5 34
Leave Page 1531640197067 Page_Home / / / /
(a) Swipe-interactive behavior
Event Page Button Timestamp
2101 Page_Detail Page_SearchItem_Button-Status 1531640197187
2001 Page_Detail / 1531640197199
2101 Page_Detail Page_Detail_Button-Comments 1531640197219
(b) Tap-interactive behavior

Table 2: An example of raw data in the touch-interactive behavior dataset.

2.3 Multi-task Learning Table 2a shows an example of raw data in the swipe-interactive
Multi-task learning has been used successfully across various appli- behavior. A user’s swipe-interactive track is a time sequence of
cations of machine learning, from natural language processing [3, 5] actions, consisting of these four basic types of actions. Each action
and speech recognition [6] to computer vision [8] and recommender has a timestamp and a page index to identify when and where the
systems [14]. By sharing representations between related tasks, action occurs. In addition, the positional coordinates of the action
multi-task learning can enable the model to capture more under- on the touch screen are also recorded. The duration presents how
lying factors and generalize better on its original task. Ruder [16] long the action lasts. As shown in Table 3a, for each action at a data
presents an overview of multi-task learning in deep learning, where point, we extract 14 raw features. The time duration of a swipe,
multi-task learning is typically done with either hard or soft param- time gap between two actions and positional coordinates of actions
eter sharing of hidden layers. The hard parameter sharing method are continuous variables. Page indices, action indices and swipe
is the most commonly used multi-task learning approach, which directions (i.e., left/right and up/down) are categorical variables.
shares the hidden layers between all tasks and keeps several task- We conduct discretization on all the raw features to ensure unified
specific output layers. Collobert et al. [5] simultaneously learn inputs for DIPN. The discretization of the continuous variables is
several NLP tasks using a language model with embedding lookup described as follows:
table sharing. In [8], multi-task learning is adopted to improve the • Position. The positional coordinates of actions are continu-
performance of classifying object proposals using deep convolu- ous values, and are discretized according to the resolution
tional networks. Ni et al. [14] use deep multi-task representation of the touch screen. We divide the width of the screen into
learning to generate user representations for personalization in 17 uniform segments, and the length into 25 segments for
e-commerce portal. In soft parameter sharing, each task has its own one-hot vectors encoding.
model with its own parameters where the distance between the • Swipe Length. The length of a swipe is encoded into a one-
parameters is regularized. Duong et al. [7] uses l 2 distance for reg- hot vector, the length of which is as twice as the length
ularization while Yang et al. [28] use the trace norm. Our model is of the one-hot vectors of position encoding. The reason of
related to the hard parameter sharing method. We propose a novel applying twice length is that, for a swipe track, we consider
way by partitioning a user’s purchasing intent into three different the direction of the swipe.
phases and use multi-task learning to learn the unique behavior • Time Gap and Duration. We apply a step function to en-
that would lead to different purchasing intent. code time gaps between actions and swipe duration as follow:
To the best of our knowledge, our work is the first study that uses
the attention-based deep network with multi-learning on multiple 
 ⌊x/fs ⌋, x < fb
user behavior sequences for real-time purchasing intent prediction.

y = ⌊x/fb + 9⌋, fb ≤ x < 10 × fb

 19, x ≥ 10 × fb

We build two types of user interactive behavior dataset, i.e., the new where { fs = 100, fb = 1000} are used for time gap, and
touch-interactive behavior and the traditional browse-interactive { fs = 25, fb = 250} are used for time duration.
behavior. In the following, we describe each dataset in details.
The tap-interactive behavior. This behavior records the in-
formation associated with the tap actions, as shown in Table 2b. A
3.1 Touch-interactive Behavior user’s tap-interactive track is a time sequence of tap actions. Each
The touch-interactive behavior dataset contains normal users’ daily action has a timestamp and page index to identify when and where
touch-interactive information when using the Taobao app, which the action occurs. There is also an event id to identify whether a
is composed of the swipe-interactive and tap-interactive behavior. user taps on a page or a button. If a button is tapped, the button
The swipe-interactive behavior. This behavior includes four name is also recorded. As shown in Table 3b, we extract 3 raw
types of basic actions, i.e., Open Page, Leave Page, Swipe and Tap. features, all of which are categorical variables.

Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Feature Dictionary Dim Embedding Dim 3.2 Browse-interactive Behavior

Page Index 224 32 The browse-interactive behavior represents the typical behavior
Action Index 4 4 users conduct on products when browsing an e-commerce applica-
Time Gap 20 8 tion. It includes five types of actions, i.e., browse a product, search
Tap Position X 17 8 a product, collect a product, add a product to cart and purchase a
Tap Position Y 25 8 product. A user’s browse-interactive track is a time sequence of
Swipe Start Position X 17 8 these actions. As shown in Table 3c, we extract 6 raw features for
Swipe Start Position Y 25 8 each action. The type index represents the type of an action. For
Swipe End Position X 17 8 the continuous variables, i.e., page stay time and timestamp, we
Swipe End Position Y 25 8 perform similar discretization operation as for the time gap in the
Swipe Length on X 34 8 swipe-interactive behavior.
Swipe Length on Y 50 16 A user’s purchase behavior has a high correlation with her his-
Swipe Right/Left 2 2 torical behavior, which can represent the activeness of the user.
Swipe Up/Down 2 2 Active users who behave more frequently in the history may keep
Swipe Duration 20 8 this trend in the future, while those who are inactive may stay quiet
(a) Swipe-interactive behavior for a long time. Therefore, we also collect some statistics for the
historical behavior based on the browse-interactive behavior. In
Feature Dictionary Dim Embedding Dim more details, we count up the frequency of three types of actions
Event Index 2 2 in different time windows. Table 3d shows the features extracted
Page Index 200 16 for each action, where C.F., A.F. and P.F. represent the frequency
of collecting a product, adding a product to cart and purchasing a
Button Index 500 32
product, respectively. For each action, we count the frequency of the
(b) Tap-interactive behavior
action within one week, two weeks, one month, three months, six
Feature Dictionary Dim Embedding Dim months and one year. To improve personalization, we also collect
the user profile dataset containing various users’ basic information,
Type Index 6 4 such as the age level and gender, as shown in Table 3e.
Leaf Category Index 19011 32
Top Category Index 102 16
Page Index 181 16
Page Stay Time 179 16 We propose a novel deep network, named Deep Intent Prediction
Network (DIPN), for the real-time purchasing intent prediction.
Timestamp 25 4
Figure 1 shows the model architecture of DIPN. It includes an em-
(c) Browse-interactive behavior
bedding lookup layer which embeds the one-hot vectors of the raw
Feature Dictionary Dim Embedding Dim action features into dense vectors, followed by a fully-connected
layer. After that, a bidirectional recurrent layer is applied to each
C.F. within one week 20 8 user behavior sequence to model the long-term dependencies be-
... ... ... tween complex user actions. Then a hierarchical attention layer is
C.F. within one year 100 16 applied to fuse the outputs from the recurrent layer. Finally, DIPN is
A.F. within one week 20 8 trained with multi-task learning. In the following, we will introduce
... ... ... each component of DIPN in details.
A.F. within one year 100 16
P.F. within one week 20 8 4.1 Embedding Layer
... ... ...
As introduced in Section 3, we use five groups of features in DIPN,
P.F. within one year 100 16
i.e., User Swipe-interactive Behavior, User Touch-interactive Behavior,
(d) User history feature User Browse-interactive Behavior, User History and User Profile. In the
discretization process, the raw values of every feature are encoded
Feature Dictionary Dim Embedding Dim
into one-hot vectors, the length of which is shown in the Dictionary
Age Level 9 4 Dim column of Table 3. Then the one-hot vectors are used as the
Gender 3 2 inputs of DIPN. As the inputs are high dimensional binary vectors,
Buyer Star 17 8 we use the embedding layer to transform them into low dimensional
Tm Level 6 4 dense representations. The embedding operation follows the table
VIP Level 8 4 lookup mechanism. In more details, each feature is corresponding
Phone Price 11 4 to one embedding matrix. For example, the embedding matrix of
... .... .... the Button Index feature in Table 3b is represented as Ebut t on =
(e) User profile feature [e 1 ; e 2 ; ...; en ] ∈ Rne ×nb , where ei ∈ Rne represents an embedding
vector with dimension ne , and nb represents the number of buttons
Table 3: Statistics of feature sets used in DIPN. that a user can tap. The embedding vector of the Button Index

Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Figure 1: The model architecture of DIPN.

− ←− →
− ← −
feature can then be obtained as Ebut t on · Bbut t on ∈ Rne , where h t and backward hidden states h t , i.e., ht = [ h t , h t ]. In this way,
Bbut ton ∈ Rnb is the one-hot vector of the Button Index feature. a behavior sequence is represented as h = {h 1 , h 2 , ..., hn } ∈ Rn×2d ,
The length of every embedded feature is shown in the Embedding where d is the dimension of the hidden state .
Dim column of Table 3. At last, for each feature group, all the In DIPN, we need to handle three types of behavior sequences,
embedded features are concatenated into a vector and fed into a i.e., the swipe-interactive sequence, the tap-interactive sequence
fully-connected layer for reshape. During the training process, the and the browse-interactive sequence. There are two ways to fuse
embedding layer is trained at the same time with the model. these sequences: early fusion and late fusion. The early fusion
refers to aligning the three sequences by timestamp before feeding
4.2 RNN Layer them into a single GRU model, while the late fusion refers to first
The user interactive behavior used in DIPN are all time sequence of feeding each sequence to a separate GRU model and then concate-
actions. Therefore, we use RNN to model the long-term dependen- nating the output hidden features. One disadvantage of the early
cies between actions. The adoption of RNN can eliminate the need fusion method is that the behavior sequences usually have differ-
for extensive feature engineering, which is very helpful because it ent densities, as shown in Table 1. When aligning the sequences
is difficult to extract features from the touch-interactive behavior by timestamp, dense sequence could dominate the concatenated
composed of swipe or tap actions with little semantic information. feature space and override the effects of sparse but important se-
To avoid the vanishing gradient problem suffered by the standard quence. In addition, since the length of GRU model is limited, the
RNN, LSTM and GRU are proposed to control the update of the early fusion method would result in information loss of the dense
information via gates. We take GRU to model the dependency be- sequence when truncating the sessions. Therefore, we propose to
cause GRU is faster than LSTM and more suitable for e-commerce use the late fusion method and feed the three behavior sequences to
system. The formulations of GRU are listed as follows: separate Bi-GRU models, as shown in Figure 1. After the RNN layer,
we get three hidden outputs, i.e., hs = {hs1 , hs2 , ..., hsn } ∈ Rn×2d ,
r t = σ (Wer et + Whr ht −1 + br ) ht = {ht 1 , ht 2 , ..., ht n } ∈ Rn×2d and hb = {hb1 , hb2 , ..., hbn } ∈
zt = σ (Wez et + Whz ht −1 + bz ) Rn×2d , corresponding to the swipe-interactive, tag-interactive and
(1) browse-interactive sequence, respectively.
h̃t = tanh(Weh et + Whh (r t ⊙ ht −1 ) + bh )
ht = zt ⊙ ht −1 + (1 − zt ) ⊙ h̃t .
4.3 Hierarchical Attention Layer
where et is embedding vector of the t-th action, ht is the t-th hidden To better fuse the views extracted from different behavior sequences,
states, σ is the sigmoid function and ⊙ is the element-wise product we propose a hierarchical attention mechanism, where the bottom
operator. To better capture the global information of the behavior se- attention layer focuses on the inner parts of each behavior sequence
quences, we adopt a bidirectional recurrent layer composed of two while the top attention layer learns the inter-view relations between
GRU layers working in opposite directions. We obtain the represen- different behavior sequences, as shown in Fig. 1. In the following,
tation of the t-th action by concatenating the forward hidden state we introduce the hierarchical attention mechanism in details.

Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Intra-view Attention. The intra-view attention mechanism at

the bottom tries to identify the important actions within each se-
quence that contribute more to the purchasing intent prediction.
Intuitively, the current behavior conducted by a user is most pre-
dictive of purchasing intent. To capitalize on this, we calculate the
attention score between each action in a sequence and the current
action, which is formulated as:

exp(ht Wa h n )
at = Í − , (2)
n exp(h W → h )
i=1 i a n

− Figure 2: Inter-view attention mechanism.
where h n ∈ Rd is the final output state of the forward GRU model,
ht ∈ R2d is t-th output state of the Bi-GRU model, Wa ∈ R2d×d , where As ∈ Rn×2d and Ab ∈ Rn×2d are attentive representations of
and at is the attention score calculated for ht . Attention score vs and vb , respectively, and the element-wise product ⊙ is used to
→− model the interactions between As and Ab . Note that IA(vs , vb ) is a
can reflect the relationship between ht and the current action h n .
Therefore, the action that is more related to the current action can symmetric operation and returns the interaction representation rsb
get a larger attention score. between vs and vb . We can calculate rst and rtb following the same
Different from traditional attention mechanism which conducts procedure. At last, we can get three representations rsb ∈ Rn×2d ,
a weighted sum pooling operation to calculate the final output, the rst ∈ Rn×2d and rtb ∈ Rn×2d after the inter-view attention layer.
intra-view attention mechanism applies the element-wise product
on the outputs of Bi-GRU and its corresponding attention score 4.4 Multi-task Layer
vector as follows: In this work, we propose to train DIPN with multi-task learning.
The reasons are twofold. First, with multi-task learning, DIPN can
vs = hs ⊙ as , vt = ht ⊙ at , vb = hb ⊙ ab , (3)
capture common features that well depict the customers and unique
where vs ∈ Rn×2d , vt ∈ Rn×2d and vb ∈ Rn×2d are the out- features that would lead to different purchasing behavior. Second,
puts of the intra-view attention layer, corresponding to the swipe- a model that learns multiple tasks simultaneously is able to learn a
interactive sequence, tag-interactive sequence and browse-interactive more robust representation and improve the generalization.
sequence, respectively. As shown in Fig. 1, we use two tasks in this layer, i.e., the real-
Inter-view Attention. The swipe-interactive, tap-interactive time purchasing intent prediction and the long-term purchasing
and browse-interactive behavior depict a user from different views intent prediction, which are defined as a predictive measure of
simultaneously. For example, a user interested in a product would subsequent purchasing behavior within one hour and one day,
browse some comments about the product and other similar prod- respectively. The reason that we define the period of the long-term
ucts for comparison before she finally places the order. This process purchasing intent as one day is because the purchasing behavior
would generate some browse, swipe and tap actions. It is impor- conducted one day later has a relative low correlation with the
tant to utilize these relationships to model the purchasing intent current behavior sequence. As a result, we partition user purchasing
of a user. However, the actions from different views are usually intent into three phases: real-time phase, long-term phase and
asynchronous. The reasons are twofold. First, the synchronous ac- irrelevant phase. Note that we use multi-task learning to handle the
tions become asynchronous due to the different densities of each multi-class learning problem. By training the real-time intent and
behavior. Second, the related actions are originally asynchronous, long-term intent with two separate tasks, we are able to distinguish
e.g., some actions result in the occurrence of other actions. To this between the subtle differences of user behavior.
end, we propose a novel inter-view attention mechanism to better The outputs of the hierarchical attention layer are first flattened
discover the asynchronous interactions between different views. and then concatenated with the user history feature, user profile
The inter-view attention mechanism takes two views as inputs, feature and the concatenated last forward and backward state of
as shown in Figure 2. In particular, for each action in one view, we the Bi-GRU models. The concatenated feature vectors are fed into
calculate its distance with all the actions in the other view. We bor- two different branches. In each branch, fully connected layers are
row this idea from the self attention mechanism in Transformer [21]. used to learn the combination of features automatically. The loss
In this way, the asynchronous interactions between actions can be functions of the real-time purchasing intent prediction task and
captured effectively. The inter-view attention mechanism IA(vs , vb ) long-term purchasing intent prediction task are defined as follows:
is formulated as follows, where we take the swipe-interactive view N
vs and the browse-interactive view vb for example: 1 Õ
Lshor t = − (ylogps (x) + (1 − y)log(1 − ps (x)))
(x,y)∈ D
IA(vs , vb ) = As (vb , vs , vs ) ⊙ Ab (vs , vb , vb ) (5)
T 1 Õ
vb vs Llonд = − (ylogpl (x) + (1 − y)log(1 − pl (x)))
As (vb , vs , vs ) = softmax( √ )vs N
2d (4) (x,y)∈ D

vs v T where D is the training set with size N , x is the input of the net-
Ab (vs , vb , vb ) = softmax( √ b )vb ,
2d work and y is the label, ps (x) and pl (x) represents the predicted

Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

protect the user privacy, because only the prediction scores, rather
than the features capturing behavior patterns, are sent to the cloud.

In this section, we present a comprehensive evaluation of the perfor-
mance for DIPN. We first introduce the experimental setup and then
present the experimental results under various settings. Finally, we
share a case study for online serving.

6.1 Experimental Setup

Figure 3: System overview under edge computing. Dataset Statistics. We conduct the experiments on a large-scale
industrial dataset collected from Taobao. The dataset contains nor-
probability of sample x being purchased within one hour and one mal users’ daily interaction information when using our app, which
day, respectively. The global loss used in our DIPN model is: consists of four subsets including the swipe-interactive behavior,
tap-interactive behavior, browse-interactive behavior and also the
Lдlobal = Lshor t + Llonд . (6)
user profiles. We collect 800,000 users’ behavior for two weeks. For
each user, we randomly truncate about 400 groups of samples on
5 SYSTEM OVERVIEW average. In total, we obtain 300 million groups of samples. Each
Benefit from the rapid development of hardware and software on group contains the user profile feature, the user history feature,
mobile devices, we are able to collect real-time behavior features and the three sequences with length 256 (padding 0 for short ones).
to improve the performance of purchasing intent prediction. How- Note that within each group, the timestamps of the last actions in
ever, this benefit comes with a price that the high frequency of the three sequences are the same. The real-time label and long-term
occurrence of the real-time features hinders DIPN from deploying label are then tagged for each group based on the timestamp of
in the industrial environment. If DIPN is deployed at the server side the last action. We use samples in the first 13 days for training and
following traditional cloud-based computing architecture, the high samples in the last day for evaluation.
frequency of features when used for prediction would result in a Compared Methods. We compare DIPN to the state-of-the-art
high communication cost unbearable to both the servers and the approaches in purchasing intent prediction. Besides, we conduct
smartphones. To solve this problem, we propose to deploy our DIPN experiments to verify the effect of each component in DIPN. In the
model on the mobile devices following the idea of edge computing 1 , following, we introduce the compared methods briefly.
which is defined as a distributed computing paradigm in which com-
• GBDT [15]: A competitive gradient boosting model widely
putation is largely or completely performed on distributed device
used in industrial environment. For a fair comparison, be-
nodes known as smart devices or edge devices.
sides the session features used in [15], we also add the user
The overall structure of our prediction system deployed in a
history feature and user profile feature to the input of the
large-scale e-commerce platform, namely Taobao, is illustrated
model. Our goal is to see the benefit of using the new touch-
in Figure 3. The procedure is as follows. We first train our DIPN
interactive behavior in predicting user purchasing intent.
model in the cloud on powerful servers, and then use AliNN, which
• RNN+DNN [19]: A bidirectional RNN is used to model the
is Alibaba’s solution for deploying machine learning models on
dependency between the browse-interactive actions. Similar
mobile devices, to compress DIPN (with a size of 2MB) and deploy
with GBDT, we modify RNN by adding the user history
the compressed model on the devices. After that, the compressed
feature and user profile feature with DNN.
DIPN can directly use the collected real-time features on a mobile
• DIPN-early-fusion: Early fusion is applied in DIPN by first
device for prediction. Only the prediction results are sent to the
aligning the three sequences by timestamp and then feeding
cloud, which are stored in an online graph storage system and can
them into a single Bi-GRU layer. As a result, only intra-view
be used to provide services for customers later. If we need to update
attention mechanism is adopted.
DIPN , we only need to collect the training data from the devices
• DIPN-no-attention: A sub model of DIPN without using
and re-deploy the updated model to the devices.
the hierarchical attention mechanism. The outputs from the
There are several advantages to deploy DIPN on the mobile
RNN layer are simply concatenated.
devices. First, it can reduce the communication cost between the
• DIPN-no-inter-view-attention: A sub model of DIPN by
cloud and the devices significantly. During the Alibaba 2018 Double
removing the inter-view attention mechanism.
11 Shopping Festival, DIPN serves more than 10 million customers
• DIPN-no-intra-view-attention A sub model of DIPN by
without suffering from the traditional peak traffic problem of the
removing the intra-view attention mechanism.
platform on that day. Second, it can greatly increase the response
• DIPN-no-multi-task A sub model of DIPN without using
speed of DIPN by moving the model to the data instead of the data
multi-task learning.
to the model. The response time of DIPN is between 20 ∼ 50 ms on
different devices, which is immune to the influence of the network Experimental Details. DIPN is trained with SGD, using the
traffic and improves application performance. Third, it can well Adam optimizer [12] with initial hyper-parameters of ϵ = 10−3 , β 1 =
0.9 and β 2 = 0.999. The dimension of the hidden state in the Bi-
1 GRU model is set to d = 32. We train DIPN using a distributed

Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Table 4: Comparison of different models. Impact of different sources. In this paper, DIPN predicts the
real-time purchasing intent by utilizing multiple data sources si-
Model AUC multaneously. To better understand the role that different data
GBDT 0.7871 sources play, we conduct two types of tasks using DIPN. The first
RNN+DNN 0.7902 task is to predict the purchasing intent without each data source,
DIPN-early-fusion 0.7708 while the second task only uses one single data source for predic-
DIPN-no-attention 0.8345 tion. The results are shown in Table 6. We can see that each data
DIPN-no-inter-view-attention 0.8367 source gives a positive impact on the performance of DIPN. The
DIPN-no-intra-view-attention 0.8401 user profile feature performs worst in the second task, which is as
DIPN-no-multi-task 0.8371 expected because it only provides basic information about a user.
DIPN 0.8429 However, it increases AUC by about 0.5% when used together with
other data sources, because it improves personalization in DIPN.
The tap-interactive behavior plays a more significant role than the
Table 5: Impact of multi-task learning. other behavior. The reason is that it captures more real-time behav-
ior patterns compared with the browse-interactive behavior and
AUC(real-time) AUC(long-term) contains more rich semantic information compared with the swipe-
DIPN-no-multi-task 0.8371 0.8204 interactive behavior. It should be noted that the user history feature
DIPN 0.8429 0.8276 also contributes a lot to the performance of DIPN, demonstrating
that the activeness of users has a great impact on their purchasing
behavior. As shown, by utilizing all the data sources listed in our
Table 6: Impact of different sources.
paper, DIPN gains about 18.96% improvement on AUC than the
Task1 AUC Task2 AUC baseline only using traditional user behavior sequences.

DIPN w/o profile. 0.8381 DIPN w/ profile. 0.5419 6.3 Online A/B Testing
DIPN w/o history. 0.7862 DIPN w/ history. 0.7335
Coupon allocation is an important strategy for improving the Gross
DIPN w/o browse. 0.8303 DIPN w/ browse. 0.6533
Merchandise Volume (GMV) on e-commerce platforms. In this sec-
DIPN w/o swipe. 0.8287 DIPN w/ swipe. 0.6742
tion, we introduce a new coupon allocation strategy based on the
DIPN w/o tap. 0.7978 DIPN w/ tap. 0.7418
real-time purchasing intent predicted by DIPN in online traffic of
Taobao. The online A/B testing was conducted at “Double 11” in
2018, which is a shopping festival in China, similar as the “Black
TensorFlow with 1 parameter server and 100 workers. The metric
Friday” in America.
used in our experiments is Area Under the Curve (AUC), which is
We choose a coupon with 10 RMB nominal value for our testing
insensitive to class imbalance and suitable to our experiments.
and set three coupon allocation strategies to compare the perfor-
mance, defined as follows:
6.2 Experimental Results
• All-allocation Strategy where everyone in this bucket is
Results of different models. Table 4 shows the performance of
selected to get this coupon.
the evaluated models. We have the following observations. (1) DIPN
• Non-allocation Strategy where everyone in this bucket is
outperforms the baseline methods GBDT and RNN by a significant
not selected to get this coupon.
margin about 5.6% and 5.3% in terms of AUC, respectively. The
• Model-allocation Strategy which uses the score predicted
improvement of DIPN over GBDT and RNN reveals the value of
by DIPN and the fixed thresholds to decide the allocation.
adopting the touch-interactive behavior to depict users from dif-
ferent views. (2) The early fusion manner is not appropriate for The users selected to get this coupon will be pushed a popup in
fusing views from different data sources. We can see that DIPN- Taobao’s mobile application.
early-fusion performs worst among the compared models. The We use the usage rate of coupons Rc , and the GMV improvement
reason is that the early fusion method could result in the imbal- per coupon Iдmv as evaluation metrics, defined as follows:
ance of different views and information loss. (3) The hierarchical Nwb
attention mechanism plays an important role in DIPN. As shown, Rc = , (7)
DIPN-no-inter-view-attention and DIPN-no-intra-view-attention
where Nwb is the number of users who have used this coupon to
are superior to DIPN-no-attention but inferior to DIPN. This proves
buy something in a bucket b, and Nb is the total number of users
that the intra-view attention and inter-view attention mechanism
who have got this coupon in b.
are effective in identifying important actions within a view and
discovering useful asynchronous interactions between views, re-
(Gb − NNnon
G non ) Nnon Gb − Nb G non
spectively. (4) Prediction performance can be further improved by Iдmv = = , (8)
utilizing multi-task learning. Table 5 shows the results of DIPN with Nwb Nnon Nwb
or without multi-task learning. As shown, the performance of real- where Gb is the total GMV of users in a bucket b, Nb is the number
time and long-term purchasing intent prediction can be improved of users in b, G non and Nnon are the total GMV of users and the
by 0.6% and 0.7% when using multi-task learning, respectively. number of users in the non-allocation strategy bucket, respectively.

Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

