(Input Technologies) Eye Contact Detection On Mobiles

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Accurate and Robust Eye Contact Detection During

Everyday Mobile Device Interactions


Mihai Bâce Sander Staal Andreas Bulling
Department of Computer Science Institute for Visualisation and Interactive
ETH Zürich Systems, University of Stuttgart
{mbace, staals}@inf.ethz.ch [email protected]
ABSTRACT
Quantification of human attention is key to several tasks in
mobile human-computer interaction (HCI), such as predicting Attention Metrics
arXiv:1907.11115v1 [cs.HC] 25 Jul 2019

user interruptibility, estimating noticeability of user interface

Mobile Eye Contact Detection


content, or measuring user engagement. Previous works to Number of glances

study mobile attentive behaviour required special-purpose eye


tracking equipment or constrained users’ mobility. We pro- Number of attention shifts

pose a novel method to sense and analyse visual attention on


Attention spans
mobile devices during everyday interactions. We demonstrate
the capabilities of our method on the sample task of eye con- Primary attentional focus
tact detection that has recently attracted increasing research
interest in mobile HCI. Our method builds on a state-of-the-art
method for unsupervised eye contact detection and extends it
to address challenges specific to mobile interactive scenarios.
Through evaluation on two current datasets, we demonstrate Figure 1. We present a method to quantify users’ attentive behaviour
during everyday interactions with mobile devices using their integrated
significant performance improvements for eye contact detec- front-facing cameras. We evaluate our method on the sample task of eye
tion across mobile devices, users, or environmental conditions. contact detection and discuss several advanced attention metrics enabled
Moreover, we discuss how our method enables the calculation by our method, such as the number of glances, the number of attention
of additional attention metrics that, for the first time, enable shifts, or the average duration of attention span.
researchers from different domains to study and quantify at-
tention allocation during mobile interactions in the wild.
users’ mobility [33, 7], thereby limiting the ecological va-
ACM Classification Keywords lidity of the obtained findings. The need to study attention
H.5.m. Information Interfaces and Presentation (e.g. HCI): during everyday interactions has triggered research on using
Miscellaneous device interactions or events as a proxy to user attention, i.e.
assuming that attention is on the device whenever the screen
Author Keywords is on (e.g. Apple Screen Time) or whenever touch events [44,
Mobile Phone; Eye Contact Detection; Appearance-Based 26], notifications [30], or messages [10] occur. While proxy
Gaze Estimation; Attentive User Interfaces methods facilitate daily-life recordings, it is impossible to
know whether users actually looked at their device, and result-
INTRODUCTION ing attention metrics are therefore unreliable. One solution
to this problem is manual annotation of attentive behaviour
With an ever-increasing number of devices competing for it,
using video recordings [28] but this approach is tedious, time-
developing attentive user interfaces that adapt to users’ limited
consuming, and impractical for large-scale studies.
visual attention has emerged as a key challenge in human-
computer interaction (HCI) [42, 5]. This challenge has become In this work we instead study mobile attention sensing using
particularly important in mobile HCI, i.e. for mobile devices off-the-shelf cameras and appearance-based gaze estimation
used on the go, in which attention allocation is subject to a based on machine learning. This approach has a number of
variety of external influences and highly fragmented [28, 39]. advantages. First, it does not require special-purpose eye
tracking equipment given that front-facing cameras are readily
Consequently, the ability to robustly sense attentive behaviour
integrated into an ever-increasing number of mobile devices
has emerged as a fundamental requirement for predicting inter-
and offer increasingly high resolution images. Second, our
ruptibility (i.e. identifying opportune moments to interrupt a
user) [6, 12, 29], estimating the noticeability of user interface approach enables recordings of attention allocation in-situ, i.e.
content such as notifications [31], and for measuring fatigue, during interactions that users naturally perform in their daily
boredom [32], or user engagement [24]. life. Third, in combination with recent advances in machine
learning methods for appearance-based gaze estimation [55,
Previous methods to sense mobile user attention either re- 52] and device-specific model adaptation [49], our approach
quired special-purpose eye tracking equipment [39] or limited not only promises a new generation of mobile gaze-based

1
interfaces [18] but also a direct (no proxy required) and fully a method which also looks at a longer history of up to one
automatic (no manual annotation required) means to sense day [6]. In a Wizard of Oz study, Hudson et al. analysed which
user attention during mobile interactions. sensors are useful in predicting interruptibility [15]. The way
users interact with a certain device does not only depend on
We extend a recent method for unsupervised eye contact detec- the content or the application used, but could be affected by
tion in stationary settings [50] and address challenges specific the environment. Smartphone usage and interruptibility can
to mobile interaction scenarios: We use a multi-task CNN also depend on the user’s location or social context [12, 11].
for robust face detection [47] even on partially visible faces,
which is a key challenge in mobile settings [18]. We further Besides looking at interruptibility, others have used attention
combine a state-of-the-art hourglass neural network architec- to model different behavioural traits. Pielot et al. tried to
ture [8] with a Kalman filter for more accurate facial landmark predict attentiveness to mobile instant messages [31]. User
detection and head pose estimation. Reliable head pose es- engagement can be analysed by collecting, for example, EEG
timates are particularly critical in mobile settings given the data [23] or by looking at visual saliency and how this affects
large variability in head poses. We finally normalize the im- different engagement metrics [24]. Toker et al. investigate
ages [51] and train an appearance-based gaze estimator on the engagement metrics in visualisation systems which can adapt
large-scale GazeCapture dataset [20]. to individual user characteristics [40]. Alertness, another in-
dicator for attention, can be monitored continuously and un-
The specific contributions of this work are threefold. First, we
obtrusively [1]. Such characteristics can be used to better
present the first method to quantify human attention allocation
understand user attention patterns.
during everyday mobile phone interactions. Our method ad-
dresses key challenges specific to mobile settings. Second, we
evaluate our method on the sample use case of eye contact de- Attention Analysis in Mobile Settings
tection and show that our method significantly outperforms the Previous research on user behaviour, human attention, or mod-
state of the art and is robust to the significant variability caused elling behavioural traits has not focused on mobile devices.
by mobile settings with respect to users, mobile devices, and Given their popularity, understanding, detecting, modelling,
daily-life situations on two publicly available datasets [19, 13]. and predicting human attention has developed into a new area
Third, we present a set of attention metrics enabled by our of research. VADS explored the possibility of smartphone-
method and discuss how our method can be used as a gen- based detection of the user’s visual attention [16]. Users had to
eral tool to study and quantify attention allocation on mobile look at the intended object and hold the device so that the ob-
devices in-situ. ject, as well as the user’s face, can be simultaneously captured
by the front and rear cameras. In an analysis task, knowing
RELATED WORK where users direct their attention might be sufficient, however,
Our work is related to previous works on (1) user behavior to fulfill the vision of pervasive attentive user interfaces [5],
modeling on mobile devices, (2) attention analysis in mobile a system needs to predict where the user’s attention will be.
settings, and (3) eye contact detection. Steil et al. proposed an approach to forecast visual attention
by leveraging a wearable eye tracker and device integrated
User Behavior Modeling on Mobile Devices sensors [39]. Another approach is to anticipate the user’s
Over the years, smartphones have become more powerful, gaze with Generative Adversarial Networks (GANs) applied
feature-rich, miniaturised computers. Having such devices to egocentric videos [48]. Attention allocation and modelling
with us all the time has implications and our usage patterns user attention goes beyond research and lab studies. With
have changed significantly. A study shows that the nature of iOS version 12, Apple has released a built-in feature called
attentional resources on mobile devices has become highly Screen Time which measures the amount of time the screen is
fragmented and can last for as little as 4 seconds [28]. This con- on and presents usage statistics. A similar app from Google
clusion is similar to what Karlson et al. identified by looking for Android is Digital Wellbeing. Such applications provide
at task disruption and the barriers faced when performing tasks interesting insights into one’s own usage, however, they are
on their mobile device [17]. Smartphone overuse can have rather naive and always assume the users’ attention when the
negative consequences in young adults and could lead to sleep screen is on.
deprivation and attention deficits [21]. With such changes in
the interaction patterns, it has become highly relevant to study Eye Contact Detection
and model user behaviour and visual attention. Unlike gaze estimation, which regresses the gaze direction, eye
contact detection is a binary classification task, i.e. detecting
Sensor-rich mobile devices enable us to collect data and build
whether someone is looking at a target object or not. The first
models with applications in many different domains. A sig-
works in this direction used LEDs attached to the target object
nificant area of research is concerned with interruptibility or
to detect whether users were looking at the camera or not [42,
predicting the opportune moments to deliver messages and
37, 9]. Selker et al. proposed a glass-mounted device which
notifications. Mehrotra et al. investigated people’s receptiv- transmitted the user ID to the gaze target object [36]. These
ity to mobile notifications [25]. A different study measured methods require dedicated eye contact sensors and cannot be
the effects of interrupting a user while performing a task and used with unmodified mobile devices.
then evaluated task performance, emotional state, and social
attribution [2]. While many approaches only look at the imme- Recent works focused on using only off-the-shelf cameras
diate past for predicting interruptibility, Choy et al. proposed for eye contact detection. Smith et al. proposed GazeLock-

2
ing [38], a simple supervised appearance-based classification the mobile device. If the detector fails to detect any face, we
method for sensing eye contact. Ye et al. used head-mounted automatically predict this image to have no eye contact. After
wearable cameras and a learning-based approach to detect detecting the face bounding box, it is particularly important to
eye contact [46]. With recent advances in appearance-based accurately locate the facial landmarks since these are used for
gaze estimation [20, 54, 34, 53, 55], Zhang et al. proposed head pose estimation and image normalization. For additional
a full-face gaze estimation method [54] and introduced an robustness, we use a state-of-the-art hourglass model [8] which
unsupervised approach to eye contact detection in stationary estimates the 2D position of 68 different facial landmarks.
settings based on it [50]. In their approach, during training,
the gaze samples in the camera plane were clustered to au-
tomatically infer eye contact labels. Extending this method, Head Pose Estimation and Data Normalization
Mueller et al. [27] proposed an eye contact detection method The facial landmarks obtained from the previous step are used
which additionally correlates people’s gaze with their speaking to estimate the 3D head pose of the detected face by fitting a
behaviour by leveraging the fact that people often tend to look generic 3D facial shape model. In contrast to Zhang et al. who
at the person who is speaking. All these methods were limited used a facial shape model with six 3D points (four from the
to stationary settings and assumed that the camera always has two eye corners and two from the mouth), we instead used a
a clear view of the user. Only few previous works focused on model with all the 68 3D points [4], which is more robust for
gaze estimation and interaction on mobile devices but either extreme head poses, often the case in mobile settings. We first
their performance and robustness was severely limited [45, estimate an initial solution by fitting the model using the EPnP
14] or were studied in highly controlled and simplified labo- algorithm [22] and then further refine this solution by doing
ratory settings [41]. As demonstrated in previous works [19], a Levenberg-Marquardt optimization. The final estimation is
these assumptions no longer hold when using the front-facing stabilized with a Kalman filter. The PnP problem typically as-
camera from mobile devices. sumes that the camera which captured the image is calibrated.
However, since we do not know the calibration parameters
METHOD of the different front-facing cameras from the mobile devices
To detect whether users are looking at their mobile device (nor do not want to enforce this requirement due to the over-
or not, we extended the unsupervised eye contact detection head to calibrate every camera), we approximated the intrinsic
method proposed by Zhang et al. [50] to address challenges camera parameters with default values.
specific to mobile interactive scenarios. The main advantage of
this method is the ability to automatically detect the gaze target Once the 3D head pose is estimated, the face image is warped
in an unsupervised fashion, which eliminates the need for and cropped as proposed by Zhang et al. [51] to a normalized
space with fixed parameters. The benefit of this normalization
manual data annotation. The only assumption of this approach
is the ability to handle variations in hardware setups as well as
is that the camera needs to be mounted next to the target
variations due to different shapes and appearance of the face.
object. This assumption is still valid in our use case, since the
For this, we define the head coordinate system in the same
front-facing camera is always next to the device’s display.
way as proposed by the authors: The head is defined based
Figure 2 illustrates our method. During training, the pipeline on a triangle connecting the three midpoints of the eyes and
first detects the face and facial landmarks of the input image. mouth. The x-axis is defined to be the direction of the line
Afterwards, the image is normalized by warping it to a nor- connecting the midpoint of the left eye with the midpoint of
malized space with fixed camera parameters and is fed into the right eye. The y-axis is defined to be the direction from
an appearance-based gaze estimation CNN. The CNN infers the eyes to the midpoint of the mouth and lays perpendicular
the 3D gaze direction and the on-plane gaze location. By clus- to the x-axis within the triangle plane. The remaining z-axis
tering the gaze locations of the different images, the samples is perpendicular to the triangle plane and points towards the
belonging to the cluster closest to the origin of the camera back of the face. In our implementation, we chose the focal
coordinate system are labeled with positive eye contact labels length of the normalized camera to be 960, the normalized
and all other samples are labeled with negative non-eye con- distance to the camera to be 300 mm and the normalized face
tact labels. The labeled samples are then used to train a binary image size to be 448 x 488 pixels.
support vector machine (SVM), which uses 4096-dimensional
face-feature vectors to predict eye contact. For inference, the
gaze estimation CNN extracts features from the normalized Gaze Estimation
images which is then classified by the trained SVM. We use a state-of-the-art gaze estimator based on a convolu-
tional neural network (CNN) [54] to estimate the 3D gaze
Face Detection and Alignment direction. Besides the gaze vector, the CNN also outputs a
Images taken from the front-facing camera of mobile devices 4096-dimensional feature vector, which comes from the last
in the wild often contain large variation in head pose and fully-connected layer of the CNN. This face feature vector will
only parts of the face or facial landmarks may be visible [19]. later be used as input for the eye contact detector. Given that
To address this challenge specific to mobile scenarios, we our method was designed for robustness on images captured
use a more robust face detection approach which consists of with mobile devices, we trained our model on the large-scale
three multi-task deep convolutional networks [47]. In case of GazeCapture dataset [20]. This dataset consists of 1,474 dif-
multiple faces, we only keep the face with the largest bounding ferent users and around 2,5 million images captured using
box, since we assume that only one user at a time is using smartphone and tablet devices. Our trained model achieves a

3
(a) (b) (c) (d)
Multi-task CNN Hourglass NN Head pose estimation & Image
face detection landmark detection pose stabilization normalization

(e) (f) (g)


Gaze estimation Gaze locations Training
CNN clustering labels
Weighted SVM
Train Eye Contact Detector

4096-dimensional image features Test

Figure 2. Method overview. Taking images from the front-facing camera of a mobile device, our method first uses a multi-task CNN for face detection (a)
and a state-of-the-art hourglass NN to detect 68 facial landmarks (b). Then, we estimate the head pose using a 68 3d-point facial landmark model and
stabilize it with a Kalman filter (c), a requirement in mobile settings. We then normalize and crop the image (d) and feed it into an appearance-based
gaze estimator to infer the gaze direction (e). If the estimated head pose exceeds a certain threshold, we use the head pose instead of the gaze direction.
We then cluster the gaze locations in the camera plane (f) and create the training labels (g) for the eye contact detector. The weighted SVM eye contact
detector is trained with features extracted from the gaze estimation CNN.

within-dataset angular error of 4.3° and a cross-dataset angu- the images which belong to the cluster closest to the camera
lar error of 5.3° on the MPIIFaceGaze dataset [54] (which is are labeled as positive eye contact samples.
comparable to current gaze estimation approaches).
Finally, taking the labeled samples from the clustering step,
To overcome inaccurate or incorrect gaze estimates caused we train a weighted SVM based on the feature vector extracted
by extreme head poses we propose the following adaptive from the gaze estimation CNN. To reduce the dimensionality
thresholding mechanism: Whenever the pitch of the estimated of these high-dimensional feature vectors, we first apply a prin-
head pose is outside the range [−θ , θ ], or the yaw outside cipal component analysis (PCA) to the entire training data and
[−φ , φ ], we use the head pose instead of the estimated gaze reduce the dimension so that the new subspace still retains 95%
vector as a proxy for gaze direction. More specifically, we of the variance of the data. At test time, the clustering phase
assume that the gaze direction is the z-axis of the head pose. is no longer necessary. In this case, the 4096-dimensional
In practice, we set a value of 40° for both θ and φ . feature vector is extracted from the appearance-based gaze
estimation model and projected into the low-dimensional PCA
Together with the estimated 3D head pose, the gaze direction subspace. With the trained SVM, we can then classify the
can be converted to a 2D gaze location in the camera image resulting feature vector as eye contact or non eye contact.
plane. We assume that each gaze vector in the scene originates
from the midpoint of the two eyes. This midpoint can easily
EVALUATION
be computed in the camera coordinate system, since the 3D
head pose has already been estimated in an earlier step of We evaluated our method on the sample task of eye contact
the pipeline. Given that the image plane is equivalent to the detection – a long-standing challenge in attentive user inter-
xy-plane of the camera coordinate system, the on-plane gaze faces that has recently received renewed attention in the re-
location can be calculated by intersecting the gaze direction search community [50, 27, 46, 9] but so far remains largely
with the image plane. unexplored in mobile HCI. We conducted experiments on
two challenging, publicly available datasets with complemen-
tary characteristics in terms of users, devices, and environ-
Clustering and Eye Contact Detection mental conditions (see Figure 3): the Mobile Face Video
After estimating the on-plane gaze locations for the provided (MFV) [13] and the Understanding Face and Eye Visibility
face images, these 2D locations are sampled for clustering. (UFEV) dataset [19]. We aimed to investigate the perfor-
Similarly, as proposed by Zhang et al. [50], we assume that mance of our method on both datasets and to compare it to the
each cluster corresponds to a different eye contact target object. state-of-the-art method for eye contact detection in stationary
In our case, the cluster closest to the camera (i.e., closest to human-object and human-human interactions [50].
the origin) corresponds to looking at the mobile device. To
filter out unreliable samples, we skip images for which the Mobile Face Video Dataset (MFV)
confidence value reported by the face detector is below a This dataset includes 750 face videos from 50 users captured
threshold of 0.9. Clustering of the remaining samples is done using the front-facing camera of an iPhone 5s. During data
using the OPTICS algorithm [3]. As a result of clustering, all collection, users had to perform five different tasks under

4
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Original image
Face and landmarks
Head pose and gaze
Eye contact

Figure 3. Sample results for eye contact detection on images from the two datasets, MFV and UFEV. The first row shows the input image; the second
row the detected face (in yellow) and facial landmarks (in red); the third row shows the estimated head pose and gaze direction (purple); the fourth row
shows the eye contact detection result, green for eye contact, red for non eye contact. Columns (1-9) illustrate how our method works across different
users, head pose angles, illumination conditions, or when the face is partially visible. Our method can fail if the gaze estimates are inaccurate (10), the
eyes are closed (11), or if the face detector fails (12).

different lighting conditions (well-lit, dim light, and daylight). of the frames were labeled as negative and 83% were labeled
From the five tasks available in the dataset, we selected the as positive eye contact samples. In contrast to the previous
“enrollment” task where users were asked to turn their heads dataset, these samples exhibit a class imbalance between posi-
in four different directions (left, right, up, and down). We tive and negative labels. This dataset is challenging because
picked this task because it enabled us to collect both eye the full face is only visible in about 29% of the images.
contact and non eye contact data. From this subset (1 video
per task × 3 sessions × 50 users), we randomly sampled 4,363
frames that we manually annotated with positive eye contact Baselines
or negative non eye contact labels. 58% of the frames were There are different ways to detect eye contact, such as Gaze-
labeled as positive and 42% were labeled as negative samples. Locking [38] which is fully supervised, or methods that infer
This dataset is challenging because it contains large variations the coarse gaze direction [35] or leverage head orientation
between users, head pose angles, and illumination conditions. for visual attention estimation [43]. However, all of these
methods are inferior to the state-of-the-art eye contact detector
proposed by Zhang et al. [50]. We therefore opted to only
Understanding Face and Eye Visibility Dataset (UFEV) compare our method (Ours) to two variants of the latter:
This dataset consists of 25,726 in the wild images taken using
the front-facing camera of different smartphones of ten partic- (1) Zhang et al. [50]. Here, we used the dlib1 CNN face de-
ipants. The images were collected during everyday activities tector, the dlib 68 landmark detector, and we trained a full-
in an unobtrusive way using an application running in the face appearance-based gaze estimator on the MPIIFaceGaze
background. We randomly sampled 5,065 images from this dataset [54]. We replicate the original method proposed by
dataset and manually annotated them with eye contact labels. the authors.
We only sampled images where at least parts of the face were
visible (which was the case for 14,833 photos). Around 17% 1 http://dlib.net/

5
Ours Zhang et al. + FA Zhang et al. [50] To better understand the limitations of the clustering and the
1.0 potential for further improvements, we also analysed the im-
pact of the unsupervised clustering approach on the eye con-
Matthews Correlation Coefficient (MCC)

0.9
tact classification performance. To eliminate the influence of
0.8
wrong labels resulting from incorrect clustering, we replaced
0.7 the estimated labels with the manual ground truth annotations.
0.6 As such, this defines an upper bound on the classification
0.5 accuracy given perfect labels.
0.4
The transparent bars in Figure 4 show the result of this analy-
0.3 sis, i.e. the potential performance increase when using ground
0.2 truth labels. Despite the improvement of the two baselines,
0.1 our proposed method is still able to outperform them (an MCC
0.84 0.52 0.41 0.75 0.18 0.42 score of 0.88 in comparison to 0.72 for both baselines on the
0.0
MFV dataset UFEV dataset MFV dataset and 0.73 in comparison to 0.45 and 0.56 on the
Figure 4. Classification performance of the different methods on the UFEV dataset). Furthermore, our proposed method is close
two datasets. The bars are the MCC value and the error bars represent to the upper bound performance with ground truth informa-
the standard deviation from a leave-one-participant-out cross-validation. tion. We believe this difference can be attributed to our gaze
The transparent bars illustrate the potential performance improvements estimation pipeline. Due to the improved training steps of
when assuming perfect clustering.
the gaze estimator (face and landmark detection, head pose
estimation, and data normalization) combined with the Gaze-
Capture [20] dataset, our model can extract more meaningful
(2) Zhang et al. + FA. Here, we replace the dlib face and features from the last fully connected layer of the CNN which,
landmark detector. For face detection, we used the more ro- in turn, improves the weighted SVM binary classifier.
bust approach which leverages three multi-task CNNs [47]
which can detect partially visible faces, a challenge and a
Performance of Detecting Non-Eye Contact
requirement in mobile gaze estimation. Similarly, we re-
placed the landmark detector with a newer approach which The complementary problem to eye contact detection is to
uses a state-of-the-art hourglass model [8] to estimate the identify non eye contact or when the users look away from the
2D location of the facial landmarks. The CNN architecture device. In some datasets, there are only a few non eye contact
and trained model were the same as in the first baseline. samples (e.g. only 17% in the UFEV dataset). Accurately
detecting such events is equally, if not more important and at
In all experiments that follow, we evaluated performance in the same time significantly more challenging than detecting
terms of the Matthews Correlation Coefficient (MCC). The eye contact events due to their sparsity. One performance
MCC score is commonly used as a performance measure for indicator in such cases is the true negative rate (TNR). These
binary (two-class) classification problems. The MCC is more events are critical in determining whether there was an atten-
informative than other metrics (such as accuracy) because it tion shift from the device to the environment or the other way
takes into account the balance ratios of the four confusion around. As seen in previous work [39], these events are not
matrix categories (true positives TP, true negatives TN, false only relevant attention metrics but they can be used as part of
positives FP, false negatives FN). This is particularly impor- approaches to forecast user attention (i.e. predict an attention
tant for eye contact detection on mobile devices. For example, shift before it actually happens).
in the UFEV dataset, from the manually annotated images,
83% of them are positive eye contact and only 17% represent Table 1 summarises the results of comparing the TNR of the
non eye contact. A MCC of +1.0 indicates perfect predic- three methods. The TNR measures the proportion of non
tions, -1.0 indicates total disagreement between predictions eye contact (negative) samples correctly identified. On the
and observations, and 0 is the equivalent of random guessing. MFV dataset, our method is able to outperform the two base-
lines and identify more than twice as many non eye contact
samples (21.5% in comparison to 7.6% or 8.8%) and more
accurately (TNR of 88%). On the UFEV dataset the number of
Eye Contact Detection Performance predicted samples is comparable for all three methods but our
Figure 4 shows the performance comparison of the three meth- method again significantly outperforms the other two in terms
ods. Our evaluation was conducted on the two datasets using a of robustness of identifying non eye contact events (TNR of
leave-one-participant-out cross validation. The bars represent 74% compared to 51% and 41% achieved by the other meth-
the mean MCC value and the error bars represent the standard ods).
deviation across the different runs. As can be seen from the fig-
ure, on the MFV dataset our method (MCC 0.84) significantly
outperforms both baselines (MCC 0.52 and 0.41). The same Cross-Dataset Performance
holds for the UFEV dataset where Ours (MCC 0.75) shows In order to realistically assess performance for eye contact
significantly increased robustness in comparison to Zhang et al. detection with a view to practical applications and actual de-
+ FA (0.18) and Zhang et al. (0.42). The differences between ployments, it is particularly interesting to evaluate the cross-
Ours and the other baselines are significant (t-test, p < 0.01). dataset performance. Cross-dataset performance evaluations

6
# images GT Pred TNR Ours Zhang et al. + FA Zhang et al. [50]
1.0

Matthews Correlation Coefficient (MCC)


MFV dataset 0.9
0.8
Zhang et al. 3,663 32.0% 7.6% 40% 0.7
Zhang et al. + FA 3,960 36.4% 8.8% 50% 0.6

Ours 3,960 36.4% 21.5% 88% 0.5


0.4
0.3
UFEV dataset 0.2
0.1
Zhang et al. 3,517 16.4% 4.8% 51% 0.83 0.38 0.47 0.57 0.04 0.14
0.0
Train UFEV Train MFV
Zhang et al. + FA 4,909 16.3% 9.1% 41% Test MFV Test UFEV
Ours 4,909 16.7% 8.9% 74% Figure 5. Cross-dataset classification performance of the different meth-
ods on the two datasets. In both cases, the three methods were trained
Table 1. Classification performance as true negative rate (TNR) for non on one entire dataset (UFEV or MFV) and tested on the other. The bars
eye contact detection. A comparison between the ground truth nega- represent the MCC value. Our method is able to better abstract away
tive (GT) labels distribution and the predicted negative labels distribu- data-specific biases which is important for in-the-wild studies.
tion (Pred). The number of images used in the evaluation is dependent
on the performance of the face detector. On both datasets, our method is
able to outperform the two baselines and can correctly detect more non GazeCapture MFV UFEV
70
eye contact events.
40

Yaw (°) 0

have only recently started to being investigated in gaze estima- 40

tion research [55] and, to the best of our knowledge, never for 70
70 40 0 40 70 70 40 0 40 70 70 40 0 40 70
the eye contact detection task. To this end, we first trained on
Pitch (°)
one dataset, either UFEV or MFV, and then evaluated on the
other one. Figure 6. The distribution of the head pose angles (pitch and yaw) in
the normalized space for the GazeCapture, MFV, and UFEV datasets.
Figure 5 summarises the results of this analysis and shows Images collected during everyday activities (MFV and UFEV) exhibit a
that our method is able to outperform both baselines by a sig- larger head pose variability than gaze estimation datasets like GazeCap-
nificant margin both when training on UFEV and testing on ture [20].
MFV, and vice versa. When training on the UFEV dataset,
our method (MCC 0.83) performs better than the two base-
lines (MCC 0.38 and 0.47). The other way around, training To overcome the above limitation, we apply the following
on MFV and testing on UFEV, Ours (0.57) still outperforms adaptive thresholding technique: Whenever the horizontal or
Zhang et al. + FA (0.04) and Zhang et al. (0.14). Taken to- vertical head pose angle is below or above a certain threshold,
gether, these results demonstrate that our method, which we we replace the gaze estimates by the head pose angles. This
specifically optimised for mobile interaction scenarios, is able adaptive thresholding technique happens in the normalized
to abstract away dataset specific-biases and to generalize well space [51], thus only two threshold values are necessary, one
on other datasets. As such, this result is particularly impor- vertical and one horizontal.
tant for HCI practitioners who want to use such a method for Given the distribution of the GazeCapture training data (see
real-world experiments on unseen data. Figure 6), in our approach we empirically determined a thresh-
old of 40° for the head pose in the normalized camera space.
This is, whenever the head pose angle of either component
The Influence of Head Pose Thresholding
(vertical or horizontal) is above or below this threshold, we
In order to reduce the impact of incorrect or inaccurate gaze use the head pose angles as a proxy for gaze estimates.
estimates on eye contact detection performance, in our method
we introduced a thresholding step based on the head pose Table 2 shows the results of an ablation study with two other
angle. Current datasets [20, 54] have improved the state of the- versions of our pipeline: The Gaze only (MCC 0.74 on MFV
art in appearance-based gaze estimation significantly, however, and 0.30 on UFEV) baseline does not use any thresholding.
they offer limited head pose variability when compared to data The Head pose only (MCC 0.37 on MFV and 0.73 on UFEV)
collected in the wild (see Figure 6). Like in many other areas baseline replaces all the gaze estimates with head pose esti-
in computer vision, this fundamentally limits the performance mates. The results show that Gaze only or Head pose only can
of learning-based methods. In our method, we train our model yield reasonable performance for individual datasets. How-
on the GazeCapture dataset which, currently, is the largest ever, only our method (MCC 0.86 on MFV and 0.76 on UFEV)
publicly available dataset for gaze estimation. Still, both the is able to perform well on both datasets, outperforming both
MFV and UFEV dataset show larger head pose variability. baselines. This result also shows that, since this value is set

7
MFV UFEV Figure 4, Figure 5, and Table 1) but also in terms of robust-
ness to variability in illumination conditions (see Figure 7).
Gaze only 0.74 0.30 These results, combined with the evaluations on head pose
Head pose only 0.37 0.73 thresholding, also demonstrate the unique challenges of the
mobile setting as well as the effectiveness of the proposed
Ours (Pitch = Yaw = 40°) 0.86 0.76 improvements to the method by Zhang et al. [50].
Table 2. Performance (Matthews Correlation Coefficient) of the three One of the most important applications enabled by our eye
different head pose thresholding techniques on both datasets. Gaze only contact detection method on mobile devices is attention quan-
uses no thresholding, Head pose only replaces all the gaze estimates by
head pose estimates, and Ours replaces the gaze estimates by head pose tification. In contrast to previous works that leveraged device
estimates whenever the pitch or the yaw is below or above a threshold. interactions or other events as a proxy to user attention, our
method can quantify attention allocation unobtrusively and
robustly, only requiring the front-facing cameras readily inte-
Ours Zhang et al. + FA Zhang et al. [50] grated in an ever-increasing number of devices. Being able to
1.0 accurately and robustly sense when users look at their device,
Matthews Correlation Coefficient (MCC)

0.9 or when they look away, is a key building block for future
0.8 pervasive attentive user interfaces (see Figure 8).
0.7 As a first step, in this work we focused on the sample task of
0.6 eye contact detection. It is important to note that our method
0.5 allows to automatically calculate additional mobile attention
0.4 metrics (see Figure 8) that pave the way for a number of ex-
0.3 citing new applications in mobile HCI. The first metric that
0.2 can be calculated is the number of glances that indicates how
0.1 often a user has looked briefly at their mobile device. A metric
0.67 0.33 0.50 0.88 0.36 0.46 0.86 0.45 0.47 which considers how long users look at their device is the
0.0
Dim light Well-lit Daylight average attention span. In Figure 8, the average attention span
Illumination conditions towards the device is given by the average time of the black
Figure 7. Robustness evaluation of the three methods across different boxes and the average attention span towards the environment
illumination conditions. All three methods were trained on the UFEV is given by the duration of the white boxes. Other attention
dataset and evaluated on the MFV dataset (cross-dataset) in 3 different metrics were recently introduced by Steil et al. [39] in the con-
lighting conditions: dim light, well-lit, and daylight. The bars represent text of attention forecasting. One such metrics is the primary
the MCC value. While there is a certain performance drop in dim light-
ing conditions, our method is consistently more robust and outperforms attentional focus: By aggregating and comparing the duration
the two other baselines. of all attention spans towards the mobile device as well as the
environment we can decide whether the users’ attention during
the analyzed time interval is primarily towards the device or
in the normalized camera space, the same threshold value is towards the environment. Besides aggregating, the shortest or
effective across datasets. the longest attention span might also reveal insights into users’
behaviour. Finally, the number of attention shifts can capture
Robustness to Variability in Illumination
the users’ interaction with the environment. An attention shift
Given that unconstrained mobile eye contact detection im-
occurs when users shift their attention from the device to the
plies different environments and conditions, we analysed how
environment or the other way around.
varying illumination affected our method’s performance in
comparison to the two baselines (see Figure 7). In this evalu- An analysis which quantifies attentive behaviour with some of
ation, we trained all three methods on the UFEV dataset and the metrics described previously is only the first step. Mobile
evaluated their performance in three different scenarios on devices are powerful sensing platforms equipped with a wide
a subset from the MFV dataset for which we had both eye range of sensors besides the front-facing camera and a user’s
contact and illumination labels: dim light (986 images), well- context might provide additional behavioral insights. Future
lit (2157 images), and daylight (1221 images). Our method work could compare attention allocation relative to the applica-
clearly outperforms the two baselines in all the three scenarios tion running in the foreground on the mobile device. Such an
(0.67 vs. 0.50 for dim light, 0.88 vs. 0.46 for well-lit, and analysis could reveal, for example, differences (or similarities)
0.86 vs. 0.47 for daylight against the best baseline, Zhang et. in attentive behaviour when messaging, when using social me-
al. [50]). The Zhang et al. + FA baseline is inferior to Zhang dia, or when browsing the internet. A different analysis could
et al. because it uses the improved face and landmark detector factor in the user’s current activity (attention allocation while
which detects more challenging images otherwise skipped in taking the train, walking, or standing) or the user’s location.
the evaluation with the Zhang et al. baseline. Going beyond user context, attention allocation could even be
conditionally analysed on demographic factors such as age,
DISCUSSION sex, profession, or ethnicity.
Our evaluations show that our method not only significantly
outperforms the state of the art in terms of mobile eye con-
tact detection performance both within- and cross-dataset (see

8
CONCLUSION
In this work, we proposed a novel method to sense and analyse
t t+1
users’ visual attention on mobile devices during everyday inter-
actions. Through in-depth evaluations on two current datasets,
we demonstrated significant performance improvements for
# of glances the sample task of eye contact detection across mobile devices,
users, or environmental conditions compared to the state of
# of attention shifts
the art. We further discussed a number of additional atten-
tion metrics that can be extracted using our method and that
Attention span
have wide applicability for a range of applications in attentive
Primary attentional focus
Mobile Device Environment
user interfaces and beyond. Taken together, these results are
Time
significant in that they, for the first time, enable researchers
and practitioners to unobtrusively study and robustly quantify
Figure 8. Our eye contact detection method enables studying and quan-
tifying attention allocation during everyday mobile interactions. Know- attention allocation during mobile interactions in daily life.
ing when users look at their device (black blocks) and when they look
away (white blocks) is a key component in deriving attention metrics REFERENCES
such as the number of glances (in yellow), the number of attention shifts 1. Saeed Abdullah, Elizabeth L. Murnane, Mark Matthews,
(in green from the environment to the device and in purple from the Matthew Kay, Julie A. Kientz, Geri Gay, and Tanzeem
device towards the environment), the duration of attention span (total
duration of attention towards the device or the environment in a time Choudhury. 2016. Cognitive Rhythms: Unobtrusive and
interval), or the primary attentional focus. Continuous Sensing of Alertness Using a Mobile Phone.
In Proceedings of the 2016 ACM International Joint
Conference on Pervasive and Ubiquitous Computing
(UbiComp ’16). ACM, New York, NY, USA, 178–189.
DOI:http://dx.doi.org/10.1145/2971648.2971712
2. Piotr D. Adamczyk and Brian P. Bailey. 2004. If Not
Limitations and Future Work Now, when?: The Effects of Interruption at Different
While we have demonstrated significant improvements in Moments Within Task Execution. In Proceedings of the
terms of performance and robustness for mobile eye contact SIGCHI Conference on Human Factors in Computing
detection, our method also has several limitations. Systems (CHI ’04). ACM, New York, NY, USA, 271–278.
DOI:http://dx.doi.org/10.1145/985692.985727
One of the key components in our pipeline is the appearance-
based gaze estimator and our method’s performance is directly 3. Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel,
influenced by it. In our experiments, we highlighted a limita- and Joerg Sander. 1999. OPTICS: Ordering Points to
tion of current gaze estimation datasets, namely the limited Identify the Clustering Structure. Sigmod Record 28 (6
variability in head pose angles in comparison to data collected 1999), 49–60. DOI:
in the wild. As a result, gaze estimates tend to be inaccurate http://dx.doi.org/10.1145/304182.304187
and harm performance of our method. We addressed this limi- 4. Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe
tation by introducing adaptive thresholding which, for extreme Morency. 2016. OpenFace: An open source facial
head poses, uses the head pose as a proxy to the unreliable behavior analysis toolkit.. In WACV. IEEE Computer
gaze estimates. Overall, this improved performance but may Society, 1–10.
miss cases when the head is turned away from the device but
users still look at it. One possibility to address this problem 5. Andreas Bulling. 2016. Pervasive Attentive User
is to collect new gaze estimation datasets with more realistic Interfaces. IEEE Computer 1 (2016), 94–98.
head pose distributions to improve model training. 6. Minsoo Choy, Daehoon Kim, Jae-Gil Lee, Heeyoung
Besides further improved performance, runtime improvements Kim, and Hiroshi Motoda. 2016. Looking Back on the
will broaden our method’s applicability and practical useful- Current Day: Interruptibility Prediction Using Daily
ness. In its current implementation, our approach is only suited Behavioral Features. In Proceedings of the 2016 ACM
for offline attention analysis, i.e. for processing image data International Joint Conference on Pervasive and
post-hoc. While such post-hoc analysis will already be suffi- Ubiquitous Computing (UbiComp ’16). ACM, New York,
cient for many applications, real-time eye contact detection NY, USA, 1004–1015. DOI:
http://dx.doi.org/10.1145/2971648.2971649
on mobile devices will pave the way for a whole new range
of applications completely unthinkable today. In particular, 7. Alexandre De Masi and Katarzyna Wac. 2018. You’Re
we see significant potential of real-time eye contact detection Using This App for What?: A mQoL Living Lab Study.
for mobile HCI tasks such as predicting user interruptibility, In Proceedings of the 2018 ACM International Joint
estimating noticeability of user interface content, or measuring Conference and 2018 International Symposium on
user engagement. Additionally, a real-time algorithm could Pervasive and Ubiquitous Computing and Wearable
process the recorded video directly on the device and would Computers (UbiComp ’18). ACM, New York, NY, USA,
not require to store them externally, potentially even in the 612–617. DOI:
cloud, as this will likely raise serious privacy concerns. http://dx.doi.org/10.1145/3267305.3267544

9
8. Jiankang Deng, Yuxiang Zhou, Shiyang Cheng, and INFOCOM 2016-The 35th Annual IEEE International
Stefanos P. Zafeiriou. 2018. Cascade Multi-View Conference on. IEEE, 1–9.
Hourglass Model for Robust 3D Face Alignment. 2018
17. Amy K. Karlson, Shamsi T. Iqbal, Brian Meyers,
13th IEEE International Conference on Automatic Face
Gonzalo Ramos, Kathy Lee, and John C. Tang. 2010.
& Gesture Recognition (FG 2018) (2018), 399–403.
Mobile Taskflow in Context: A Screenshot Study of
9. Connor Dickie, Roel Vertegaal, Jeffrey S. Shell, Changuk Smartphone Usage. In Proceedings of the SIGCHI
Sohn, Daniel Cheng, and Omar Aoudeh. 2004. Eye Conference on Human Factors in Computing Systems
Contact Sensing Glasses for Attention-sensitive Wearable (CHI ’10). ACM, New York, NY, USA, 2009–2018. DOI:
Video Blogging. In CHI ’04 Extended Abstracts on http://dx.doi.org/10.1145/1753326.1753631
Human Factors in Computing Systems (CHI EA ’04).
18. Mohamed Khamis, Florian Alt, and Andreas Bulling.
ACM, New York, NY, USA, 769–770. DOI:
2018a. The Past, Present, and Future of Gaze-enabled
http://dx.doi.org/10.1145/985921.985927
Handheld Mobile Devices: Survey and Lessons Learned.
10. Tilman Dingler and Martin Pielot. 2015. I’Ll Be There In Proc. International Conference on Human-Computer
for You: Quantifying Attentiveness Towards Mobile Interaction with Mobile Devices and Services
Messaging. In Proceedings of the 17th International (MobileHCI). 38:1–38:17. DOI:
Conference on Human-Computer Interaction with Mobile http://dx.doi.org/10.1145/3229434.3229452
Devices and Services (MobileHCI ’15). ACM, New York,
19. Mohamed Khamis, Anita Baier, Niels Henze, Florian Alt,
NY, USA, 1–5. DOI:
and Andreas Bulling. 2018b. Understanding Face and
http://dx.doi.org/10.1145/2785830.2785840
Eye Visibility in Front-Facing Cameras of Smartphones
11. Trinh Minh Tri Do, Jan Blom, and Daniel Gatica-Perez. Used in the Wild. In Proceedings of the 2018 CHI
2011. Smartphone Usage in the Wild: A Large-scale Conference on Human Factors in Computing Systems
Analysis of Applications and Context. In Proceedings of (CHI ’18). ACM, New York, NY, USA, Article 280, 12
the 13th International Conference on Multimodal pages. DOI:http://dx.doi.org/10.1145/3173574.3173854
Interfaces (ICMI ’11). ACM, New York, NY, USA,
353–360. DOI: 20. Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini
http://dx.doi.org/10.1145/2070481.2070550
Kannan, Suchendra Bhandarkar, Wojciech Matusik, and
Antonio Torralba. 2016. Eye Tracking for Everyone. In
12. Anja Exler, Marcel Braith, Andrea Schankin, and IEEE Conference on Computer Vision and Pattern
Michael Beigl. 2016. Preliminary Investigations About Recognition (CVPR).
Interruptibility of Smartphone Users at Specific Place
Types. In Proceedings of the 2016 ACM International 21. Uichin Lee, Joonwon Lee, Minsam Ko, Changhun Lee,
Joint Conference on Pervasive and Ubiquitous Yuhwan Kim, Subin Yang, Koji Yatani, Gahgene Gweon,
Computing: Adjunct (UbiComp ’16). ACM, New York, Kyong-Mee Chung, and Junehwa Song. 2014. Hooked on
NY, USA, 1590–1595. DOI: Smartphones: An Exploratory Study on Smartphone
http://dx.doi.org/10.1145/2968219.2968554
Overuse Among College Students. In Proceedings of the
SIGCHI Conference on Human Factors in Computing
13. Mohammed E. Fathy, Vishal M. Patel, and Rama Systems (CHI ’14). ACM, New York, NY, USA,
Chellappa. 2015. Face-based Active Authentication on 2327–2336. DOI:
mobile devices. 2015 IEEE International Conference on http://dx.doi.org/10.1145/2556288.2557366
Acoustics, Speech and Signal Processing (ICASSP)
(2015), 1687–1691. 22. Vincent Lepetit, Francesc Moreno-Noguer, and Pascal
Fua. 2009. EPnP: An accurate O(n) solution to the PnP
14. Corey Holland and Oleg Komogortsev. 2012. Eye problem. International Journal of Computer Vision 81 (2
tracking on unmodified common tablets: challenges and 2009). DOI:
solutions. In Proceedings of the Symposium on Eye http://dx.doi.org/10.1007/s11263-008-0152-6
Tracking Research and Applications. ACM, 277–280.
23. Akhil Mathur, Nicholas D. Lane, and Fahim Kawsar.
15. Scott Hudson, James Fogarty, Christopher Atkeson, 2016. Engagement-aware Computing: Modelling User
Daniel Avrahami, Jodi Forlizzi, Sara Kiesler, Johnny Lee, Engagement from Mobile Contexts. In Proceedings of the
and Jie Yang. 2003. Predicting Human Interruptibility 2016 ACM International Joint Conference on Pervasive
with Sensors: A Wizard of Oz Feasibility Study. In and Ubiquitous Computing (UbiComp ’16). ACM, New
Proceedings of the SIGCHI Conference on Human York, NY, USA, 622–633. DOI:
Factors in Computing Systems (CHI ’03). ACM, New http://dx.doi.org/10.1145/2971648.2971760
York, NY, USA, 257–264. DOI:
http://dx.doi.org/10.1145/642611.642657
24. Lori McCay-Peet, Mounia Lalmas, and Vidhya
Navalpakkam. 2012. On Saliency, Affect and Focused
16. Zhiping Jiang, Jinsong Han, Chen Qian, Wei Xi, Kun Attention. In Proceedings of the SIGCHI Conference on
Zhao, Han Ding, Shaojie Tang, Jizhong Zhao, and Human Factors in Computing Systems (CHI ’12). ACM,
Panlong Yang. 2016. VADS: Visual attention detection New York, NY, USA, 541–550. DOI:
with a smartphone. In Computer Communications, IEEE http://dx.doi.org/10.1145/2207676.2207751

10
25. Abhinav Mehrotra, Veljko Pejovic, Jo Vermeulen, Robert 33. Qing-Xing Qu, Le Zhang, Wen-Yu Chao, and Vincent
Hendley, and Mirco Musolesi. 2016. My Phone and Me: Duffy. 2017. User Experience Design Based on
Understanding People’s Receptivity to Mobile Eye-Tracking Technology: A Case Study on Smartphone
Notifications. In Proceedings of the 2016 CHI APPs. In Advances in Applied Digital Human Modeling
Conference on Human Factors in Computing Systems and Simulation, Vincent G. Duffy (Ed.). Springer
(CHI ’16). ACM, New York, NY, USA, 1021–1032. DOI: International Publishing, Cham, 303–315.
http://dx.doi.org/10.1145/2858036.2858566
34. Rajeev Ranjan, Shalini De Mello, and Jan Kautz. 2018.
26. Philipp Müller, Daniel Buschek, Michael Xuelin Huang, Light-Weight Head Pose Invariant Gaze Tracking. 2018
and Andreas Bulling. 2019. Reducing Calibration Drift in IEEE/CVF Conference on Computer Vision and Pattern
Mobile Eye Trackers by Exploiting Mobile Phone Usage. Recognition Workshops (CVPRW) (2018), 2237–22378.
In Proc. International Symposium on Eye Tracking
35. Adria Recasens, Aditya Khosla, Carl Vondrick, and
Research and Applications (ETRA). DOI:
Antonio Torralba. 2015. Where are they looking? In
http://dx.doi.org/10.1145/3314111.3319918
Advances in Neural Information Processing Systems 28,
27. Philipp Müller, Michael Xuelin Huang, Xucong Zhang, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and
and Andreas Bulling. 2018. Robust Eye Contact R. Garnett (Eds.). Curran Associates, Inc., 199–207.
Detection in Natural Multi-Person Interactions Using http://papers.nips.cc/paper/
Gaze and Speaking Behaviour. In Proc. International 5848-where-are-they-looking.pdf
Symposium on Eye Tracking Research and Applications
36. Ted Selker, Andrea Lockerd, and Jorge Martinez. 2001.
(ETRA). 31:1–31:10. DOI:
Eye-R, a Glasses-mounted Eye Motion Detection
http://dx.doi.org/10.1145/3204493.3204549
Interface. In CHI ’01 Extended Abstracts on Human
28. Antti Oulasvirta, Sakari Tamminen, Virpi Roto, and Jaana Factors in Computing Systems (CHI EA ’01). ACM, New
Kuorelahti. 2005. Interaction in 4-second Bursts: The York, NY, USA, 179–180. DOI:
Fragmented Nature of Attentional Resources in Mobile http://dx.doi.org/10.1145/634067.634176
HCI. In Proceedings of the SIGCHI Conference on
37. Jeffrey S. Shell, Roel Vertegaal, and Alexander W.
Human Factors in Computing Systems (CHI ’05). ACM,
Skaburskis. 2003. EyePliances: Attention-seeking
New York, NY, USA, 919–928. DOI:
Devices That Respond to Visual Attention. In CHI ’03
http://dx.doi.org/10.1145/1054972.1055101
Extended Abstracts on Human Factors in Computing
29. Martin Pielot, Bruno Cardoso, Kleomenis Katevas, Joan Systems (CHI EA ’03). ACM, New York, NY, USA,
Serrà, Aleksandar Matic, and Nuria Oliver. 2017. Beyond 770–771. DOI:http://dx.doi.org/10.1145/765891.765981
Interruptibility: Predicting Opportune Moments to
38. Brian A. Smith, Qi Yin, Steven K. Feiner, and Shree K.
Engage Mobile Phone Users. Proc. ACM Interact. Mob.
Nayar. 2013. Gaze Locking: Passive Eye Contact
Wearable Ubiquitous Technol. 1, 3, Article 91 (Sept.
Detection for Human-object Interaction. In Proceedings
2017), 25 pages. DOI:http://dx.doi.org/10.1145/3130956
of the 26th Annual ACM Symposium on User Interface
Software and Technology (UIST ’13). ACM, New York,
30. Martin Pielot, Karen Church, and Rodrigo de Oliveira. NY, USA, 271–280. DOI:
2014a. An In-situ Study of Mobile Phone Notifications. http://dx.doi.org/10.1145/2501988.2501994
In Proceedings of the 16th International Conference on
39. Julian Steil, Philipp Müller, Yusuke Sugano, and Andreas
Human-computer Interaction with Mobile Devices &#38;
Bulling. 2018. Forecasting User Attention During
Services (MobileHCI ’14). ACM, New York, NY, USA,
Everyday Mobile Interactions Using Device-integrated
233–242. DOI:
and Wearable Sensors. In Proceedings of the 20th
http://dx.doi.org/10.1145/2628363.2628364
International Conference on Human-Computer
31. Martin Pielot, Rodrigo de Oliveira, Haewoon Kwak, and Interaction with Mobile Devices and Services
Nuria Oliver. 2014b. Didn’T You See My Message?: (MobileHCI ’18). ACM, New York, NY, USA, Article 1,
Predicting Attentiveness to Mobile Instant Messages. In 13 pages. DOI:
Proceedings of the SIGCHI Conference on Human http://dx.doi.org/10.1145/3229434.3229439
Factors in Computing Systems (CHI ’14). ACM, New
40. Dereck Toker, Cristina Conati, Ben Steichen, and
York, NY, USA, 3319–3328. DOI:
Giuseppe Carenini. 2013. Individual User Characteristics
http://dx.doi.org/10.1145/2556288.2556973
and Information Visualization: Connecting the Dots
32. Martin Pielot, Tilman Dingler, Jose San Pedro, and Nuria Through Eye Tracking. In Proceedings of the SIGCHI
Oliver. 2015. When Attention is Not Scarce - Detecting Conference on Human Factors in Computing Systems
Boredom from Mobile Phone Usage. In Proceedings of (CHI ’13). ACM, New York, NY, USA, 295–304. DOI:
the 2015 ACM International Joint Conference on http://dx.doi.org/10.1145/2470654.2470696
Pervasive and Ubiquitous Computing (UbiComp ’15).
41. Vytautas Vaitukaitis and Andreas Bulling. 2012. Eye
ACM, New York, NY, USA, 825–836. DOI:
gesture recognition on portable devices. In Proceedings
http://dx.doi.org/10.1145/2750858.2804252
of the 2012 ACM Conference on Ubiquitous Computing.
ACM, 711–714.

11
42. Roel Vertegaal and others. 2003. Attentive user interfaces. In Proc. ACM SIGCHI Conference on Human Factors in
Commun. ACM 46, 3 (2003), 30–33. Computing Systems (CHI). 624:1–624:12. DOI:
http://dx.doi.org/10.1145/3173574.3174198
43. Michael Voit and Rainer Stiefelhagen. 2008. Deducing
the Visual Focus of Attention from Head Pose Estimation 50. Xucong Zhang, Yusuke Sugano, and Andreas Bulling.
in Dynamic Multi-view Meeting Scenarios. In 2017b. Everyday eye contact detection using
Proceedings of the 10th International Conference on unsupervised gaze target discovery. In Proceedings of the
Multimodal Interfaces (ICMI ’08). ACM, New York, NY, 30th Annual ACM Symposium on User Interface Software
USA, 173–180. DOI: and Technology. ACM, 193–203.
http://dx.doi.org/10.1145/1452392.1452425
51. Xucong Zhang, Yusuke Sugano, and Andreas Bulling.
44. Pierre Weill-Tessier, Jayson Turner, and Hans Gellersen. 2018. Revisiting Data Normalization for
2016. How do you look at what you touch?: a study of Appearance-based Gaze Estimation. In Proceedings of
touch interaction and gaze correlation on tablets. In the 2018 ACM Symposium on Eye Tracking Research &
Proceedings of the Ninth Biennial ACM Symposium on Applications (ETRA ’18). ACM, New York, NY, USA,
Eye Tracking Research & Applications. ACM, 329–330. Article 12, 9 pages. DOI:
http://dx.doi.org/10.1145/3204493.3204548
45. Erroll Wood and Andreas Bulling. 2014. Eyetab:
Model-based gaze estimation on unmodified tablet 52. Xucong Zhang, Yusuke Sugano, and Andreas Bulling.
computers. In Proceedings of the Symposium on Eye 2019. Evaluation of Appearance-Based Methods and
Tracking Research and Applications. ACM, 207–210. Implications for Gaze-Based Applications. In Proc. ACM
SIGCHI Conference on Human Factors in Computing
46. Z. Ye, Y. Li, Y. Liu, C. Bridges, A. Rozga, and J. M. Systems (CHI). DOI:
Rehg. 2015. Detecting bids for eye contact using a http://dx.doi.org/10.1145/3290605.3300646
wearable camera. In 2015 11th IEEE International
Conference and Workshops on Automatic Face and 53. X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. 2015.
Gesture Recognition (FG), Vol. 1. 1–8. DOI: Appearance-based gaze estimation in the wild. In 2015
http://dx.doi.org/10.1109/FG.2015.7163095 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 4511–4520. DOI:
47. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint Face http://dx.doi.org/10.1109/CVPR.2015.7299081
Detection and Alignment Using Multitask Cascaded
Convolutional Networks. IEEE Signal Processing Letters 54. Xucong Zhang, Yusuke Sugano, Mario Fritz, and
23, 10 (Oct. 2016), 1499–1503. DOI: Andreas Bulling. 2017. It’s Written All Over Your Face:
http://dx.doi.org/10.1109/LSP.2016.2603342 Full-Face Appearance-Based Gaze Estimation. In 1st
International Workshop on Deep Affective Learning and
48. Mengmi Zhang, Keng Teck Ma, Joo-Hwee Lim, Qi Zhao, Context Modelling. IEEE, 2299–2308.
and Jiashi Feng. 2017a. Deep Future Gaze: Gaze
Anticipation on Egocentric Videos Using Adversarial 55. Xucong Zhang, Yusuke Sugano, Mario Fritz, and
Networks.. In CVPR. 3539–3548. Andreas Bulling. 2019. MPIIGaze: Real-World Dataset
and Deep Appearance-Based Gaze Estimation. IEEE
49. Xucong Zhang, Michael Xuelin Huang, Yusuke Sugano, Transactions on Pattern Analysis and Machine
and Andreas Bulling. 2018. Training Person-Specific Intelligence (TPAMI) 41, 1 (2019), 162–175. DOI:
Gaze Estimators from Interactions with Multiple Devices. http://dx.doi.org/10.1109/TPAMI.2017.2778103

12

You might also like