Fast and Robust Dynamic Hand Gesture Recognition Via Key Frames Extraction and Feature Fusion

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Fast and Robust Dynamic Hand Gesture Recognition via

Key Frames Extraction and Feature Fusion

Hao Tang1 , Hong Liu2∗ , Wei Xiao3 , Nicu Sebe1


1 Department of Information Engineering and Computer Science, University of Trento, Trento, Italy
2 Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, Beijing, China
3 Lingxi Artificial Intelligence Co., Ltd, Shen Zhen, China

Abstract
Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role
arXiv:1901.04622v1 [cs.CV] 15 Jan 2019

in natural human-computer interface. Although great progress has been made recently, fast and robust hand gesture
recognition remains an open problem, since the existing methods have not well balanced the performance and the effi-
ciency simultaneously. To bridge it, this work combines image entropy and density clustering to exploit the key frames
from hand gesture video for further feature extraction, which can improve the efficiency of recognition. Moreover, a
feature fusion strategy is also proposed to further improve feature representation, which elevates the performance of
recognition. To validate our approach in a “wild” environment, we also introduce two new datasets called HandGesture
and Action3D datasets. Experiments consistently demonstrate that our strategy achieves competitive results on North-
western University, Cambridge, HandGesture and Action3D hand gesture datasets. Our code and datasets will release
at https://github.com/Ha0Tang/HandGestureRecognition.
Keywords: Hand gesture recognition; Key frames extraction; Feature fusion; Fast; Robust.

1. Introduction There has been significant progress in hand gesture


recognition, however, some key problems e.g., fast and ro-
Gesture recognition is to recognize category labels from bust are still challenging. Prior work usually puts empha-
an image or a video which contains gestures made by the sis on using whole data series, which always contain re-
user. Gestures are expressive, meaningful body motions dundant information, resulting in degraded performance.
involving physical movements of the fingers, hands, arms, For examples, Wang et al. [1] present a superpixel-based
head, face, or body with the intent of: conveying mean- hand gesture recognition system based on a novel super-
ingful information or interacting with the environment. pixel earth mover’s distance metric. Ren et al. [2] focus on
Hand gesture is one of the most expressive, natural building a robust part-based hand gesture recognition sys-
and common type of body language for conveying atti- tem. Hikawa and Kaida [3] propose a posture recognition
tudes and emotions in human interactions. For example, system with a hybrid network. Moreover, there are many
in a television control system, hand gesture has the follow- approaches are also proposed for action or video recogni-
ing attributes: “Pause”,“Play”, “Next Channel”, “Previ- tion task, such as [17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Liu
ous Channel”, “Volume Up”, “Volume Down” and “Menu and Shao [17] introduce an adaptive learning methodology
Item”. While in a recommendation system, hand gesture to extract spatio-temporal features, simultaneously fusing
can express “Like” or “Dislike” emotions of users. Thus, the RGB and depth information, from RGB-D video data
it is one of the most fundamental problems in computer for visual recognition tasks. Liu et al. [26] propose to
vision and pattern recognition, and has a wide range of combine the Salient Depth Map (SDM) and the Binary
applications such as virtual reality systems [1], interactive Shape Map (BSM) for human action recognition task. Si-
gaming platforms [2], recognizing sign language [3, 4, 5], monyan et al. [27] propose a two-stream ConvNet architec-
enabling very young children to interact with computers ture which incorporates spatial and temporal networks to
[6], controlling robot [7, 8], practicing music conducting [9], extract spatial and temporal features. Feichtenhofer et al.
television control [10, 11], automotive interfaces [12, 13], [28] study a number of ways of fusing ConvNet towers both
learning and teaching assistance [14, 15], and hand gesture spatially and temporally in order to best take advantage
generation [16]. of this spatio-temporal information. In sum, all these ef-
forts endeavor to decrease the computation burden in each
I E-mail: {hao.tang, niculae.sebe}@unitn.it; [email protected]; solo frame, while overlooking all processing schemes in the
[email protected]; whole frames would incur more computation burden than
∗ Corresponding author.

Preprint submitted to NeuroComputing January 16, 2019


a few selected representative frames, which is a fundamen- model. The experimental results show that the method
tal way to decrease the computation burden, greatly. This proposed is accurate and effective for dynamic hand ges-
paper is devoted to bridge the gap between fast and ro- ture recognition on four datasets. To summarize, the main
bust hand gesture recognition, simply using solo popular contributions of this paper are:
cue e.g., RGB, which ensures great potential in practical
use. • A novel key frames extraction method is proposed,
Key frames, also known as representative frames, ex- which improves efficiency of hand gesture processing.
tract the main content of a data series, which could greatly • A feature strategy is presented in which appearance
reduce the amount of processing data. In [29], the key and motion cues are fused to elevate the accuracy of
frames of the video sequence are selected by their discrim- recognition.
inative power and represented by the local motion fea-
tures detected in them and integrated from their temporal • Experiments demonstrate that our method achieves
neighbors. Carlsson and Sullivan [30] demonstrate that the balance between efficiency and accuracy simul-
specific actions can be recognized in long video sequence taneously in four datasets.
by matching shape information extracted from individual
frames to stored prototypes representing key frames of the
2. Key Frames Extraction and Feature Fusion Strat-
action. However, we regard every frame in a video as a
egy for Hand Gesture Recognition
point in the 2-D coordinate space. Since we are focus-
ing on distinguishing dynamic gesture from a data series In this section, we will introduce the proposed key
while not reconstructing it, we simply introduce a measure frames extraction and feature fusion strategy.
to find which frames are more important for distinguishing
and which are not. In consideration of information entropy 2.1. Key Frames Extraction
[31, 32, 33] could be a useful measurement to quantify the
Key frames extraction is the key technology for video
information each frame contains, we introduce frame en-
abstraction, which can remove the redundant information
tropy as an quantitative feature to describe each frame
in the video greatly. The algorithm for key frames ex-
and then map these values into a 2-D coordinate space.
traction will affect the reconstruction of video content. If
How to describe this 2-D space is a hard nut to crack for
a frame in video V can be represented by fi , where i is
its uneven distribution. Therefore, we further propose an
(1, 2, ..., n) and n is the total number of frames in video V .
integrated strategy to extract key frames using local ex-
Hence, the key frames set SKeyframes is defined as follows:
treme points and density clustering. Local extreme points
includes the local maximum and local minimum points,
SKeyframes = fKeyframes (V ), (1)
which represent the most discriminative points of frame
entropy. Shao and Ji [34] also propose a key frame extrac- where fKeyframes denotes the key frames extraction proce-
tion method based on entropy. However, the differences dure.
between [34] and the proposed method are two-folder: (i) In this paper, a method of key frames extraction based
The entropy in [34] is calculated on motion histograms of on image entropy and density clustering is proposed, as we
each frame, while the proposed method directly calculate can see from Figure 1. Our key frames extraction methods
on each frame. (ii) [34] simply to find peaks in the curve of are mainly divided into three steps, namely, 1) calculating
entropy and use histogram intersection to output final key image entropy, 2) finding local extreme points and 3) exe-
frames, while the proposed method first selects the local cuting density cluster. The following section would expand
peaks of entropy and then use density clustering to cal- upon on it.
culate the cluster centers as the final key frames. Density
clustering [35] is the approach based on the local density of 2.1.1. Image Entropy.
feature points, which is able to detect local clusters, while In this section, we try to find a proper descriptive index
previous clustering approaches such as dynamic delaunay to evaluate each frame in a video, facilitating key frame ex-
clustering [36], k-means clustering [37], spectral clustering traction. Informative frames could better summarize the
[38] and graph clustering [39] cannot detect local clusters whole video where they reside, while how to quantify the
due to the fact that they only rely on the distance between information each frame contains is a hard-nut to crack.
feature points to do clustering. Firstly, we calculate image entropy of each frame, and
In order to promote the accuracy, we also present a then map them into a two-dimensional coordinate space,
novel feature fusion method that combines appearance and as shown in Figure 1(b). Entropy is a nice way of repre-
motion cues. After extracting key frames, we replace the senting the impurity or unpredictability of a set of data
original video sequence with the key frames sequence, which since it is dependent on the context in which the measure-
could greatly enhance the time efficiency at the cost of ment is taken. As for a single video frame, the gray-scale
accuracy. This feature fusion strategy takes advantage color/intensity distribution of this frame can be seen as
of both the motion and the appearance information in
the spatiotemporal activity context under the hierarchical
2
(a) A hand gesture sequence sample from the Northwestern University hand gesture dataset, which contains 26 frames. The key frames
obtained by our method are in green boxes, which are the 2, 9, 14, 20 and 26 frames.

(b) Calculate the image entropy. (c) Select the local peak points. (d) Calculate ρ and δ.

(e) Select the number of clustering. In this (f) The final results of clustering. The points
case, we choose 5 clusters. of 2, 5, 8, 9 and 10 are the clustering centers,
therefore, the corresponding frames (2, 9, 14,
20 and 26) are the key frames of original se-
quence.

(g) The key frames are the 2, 9, 14, 20 and 26 frames. Now we use this sequence to replace the
original sequences for the next step.
Figure 1: The framework of the proposed key frames extraction method.

3
p = {p1 , p2 , ..., pn }. For the image frames fi , their image and its distance δPk from points of higher density. Both
entropy can be defined as: these quantities depend only on the distances dPk Pl be-
X tween data points, which are assumed to satisfy the trian-
E(fi ) = − pfi (j)logpfi (j), (2) gular inequality. The local density ρPk of data point Pk is
j defined as: X
ρPk = χ(dPk Pl − dc ), (6)
where pfi (j) denotes the probability density function of Pl
frame fi , which could be obtained by normalizing their
histogram of gray-scale pixel intensities. Next we map the where χ(x) = 1 if x < 0 and χ(x) = 0 when otherwise,
value E(fi ) to a two-dimensional coordinate space (the and dc is a cutoff distance. Basically, ρPk is equal to the
E(fi ) vs. i plot). number of points that are closer than dc to point Pk . The
algorithm is sensitive only to the relative magnitude of ρPk
2.1.2. Local Extreme Points. in different points, which implies that, the results of the
Secondly, we pick the local extreme points in the two- analysis are robust with respect to the choice of dc for large
dimensional coordinate space, illustrated by Figure 1(c). datasets. A different way of defining ρPk as:
Local extreme points include the local maximum points dP P
k l )2
X
and local minimum points. Local maximum points can be ρPk = e−( dc . (7)
calculated as follows: Pl
(
E(fi ), if E(fi ) > E(fi+1 ) & E(fi ) > E(fi−1 ). δPk is measured by finding the minimum distance be-
Pmax = tween the point Pk and any other point with higher den-
remove, else.
sity:
(3) δPk = min (dPk Pl ), (8)
Local minimum points can also be calculated by the fol- Pl :ρPl >ρPk
lowing formula:
which uses a Gaussian kernel to calculate. We can see
( from these two kernels, cutoff kernel is discrete value, while
E(fi ), if E(fi+1 ) > E(fi ) & E(fi−1 ) > E(fi ).
Pmin = Gaussian kernel is a continuous value, which guarantees a
remove, else. smaller probability of conflict.
(4) As we can see from Figure 1(d), we calculate ρ and δ us-
where i = 1, 2, ..., n. Therefore, local extreme points Pextreme ing Formula (6) and (8). Then select the number of cluster-
can be united by: ing center N , namely, the N largest δ values, e.g., in Figure
1(e), we select 5 cluster centers. Figure 1(f) illustrates the
Pextreme = Pmax ∪ Pmin . (5) final results of clustering, in which the points of 2, 5, 8,
9 and 10 are the clustering centers, therefore, the corre-
Local extreme points could further extract representa-
sponding x-coordinates (the 2, 9, 14, 20 and 26 frames,
tive frames from the original video sequence. This pro-
shown in figure 1(g)) are the key frames SKeyframes in the
cedure could be viewed as finding local representatives to
original video V (shown in Figure 1(a)). The pipeline of
roughly describe the original sequence.
the proposed key frames extraction method is summarized
in Algorithm 1. Note that the proposed density cluster-
2.1.3. Density Clustering.
ing cannot handle the situation where the entropy of the
After obtaining the extreme points, as shown in Figure video sequence is monotone increasing or decreasing since
1(c), we try to cluster these points into N (N is a pre- we need to select the local extreme points. While in our ex-
defined constant for all the datasets) categories, as, {1, 2}, periments, we observer that there is no one video sequence
{3, 4, 5, 6, 7}, {8}, {9} and {10}. The distribution of these which frame entropy is monotone increasing or decreasing
extreme points have the characteristics, the cluster centers all the time, it means we can always obtain the local ex-
are surrounded by neighbors with a lower local density and treme points.
that they are at a relatively large distance from any points
with a higher local density. 2.2. Feature Fusion Strategy
Therefore, we adopt density clustering [35] to further
cluster these extreme points Pextreme , as shown in Figure In view of each key frame and the relationship between
1(d-f). Density clustering could better catch the delicate each key frame, for better representing each key frame
structure of 2-D space where extreme points reside than sequence, we not only try to describe each frame of key
traditional clustering strategies, e.g. K-means. First, we frame, but also the variation between the key frames. That
search for a local density maximum point as a cluster cen- is, we not only extract the most representative frames and
ter, and then spread the cluster label from high density map them into a simple 2-D space, but also describe how
to low-density points sequentially. For each data point these frames move in this space. We believe this two-phase
Pk , we compute two quantities: corresponding local den- strategy could set up a “holographic” description of each
sity (neighborhood has a density, not the data point) ρPk hand gesture sequence. Hadid and Pietikäinen [40] also

4
Algorithm 1 The proposed key frames extraction the SVM classifier separately, we obtain two classification
method. accuracy R = {Ra , Rm }. Based on the assumption that
Require: the higher the rate is, the better representation becomes,
The original hand gesture video V , as shown in Figure 1(a) we compute the weights as follows:
and the number of key frames N (N is a pre-defined con-
stant for all the datasets). R − min(R)
T = . (9)
Ensure: (100 − min(R))/10
The key frames SKeyframes in original video V , as shown in
Figure 1(g). Finally, considering that the weight of the lowest rate is
1: Calculate image entropy E(fi ) of each frame in V using 1, the other weights can be obtained according to a linear
Formula 2; relationship of their differences to that with the lowest
2: Map E(fi ) to a two-dimensional coordinate space; rate. The final step is written as:
3: Find local maximum points Pmax in the two-dimensional
coordinate space using Formula (3); T 1 = round(T )
4: Find local minimum points Pmin in the two-dimensional
coordinate space using Formula (4);
T × ((max(T 1) − 1))
T2 = +1 (10)
5: Obtain Pextreme by uniting the local maximum points Pmax max(T )
and local minimum points Pmin using Formula (5); W = round(T 2)
6: Calculate ρ for each point in Pextreme using Formula (6) or
(7); in which W = {α, β} is the weight vector corresponding
7: Calculate δ for each point in Pextreme using Formula (8); to hist1 and hist2.
8: Draw decision graph like Figure 1(d); There are many existing descriptors for us to extract
9: Choose the N largest δ values as the clustering centers, as
hist1 and hist2. In other words, the fusion strategy does
shown in Figure 1(e);
not depend on specific descriptors, which guarantees its
10: The corresponding x-coordinate of N clustering centers are
the key frames SKeyframes . great potential in applications. In term of hist1, we can
11: return SKeyframes . use Gist [43], rootSIFT [44], HSV, SURF [45], HOG [46],
LBP [47] or its variation CLBP [48] to extract appear-
ance cue of each image. As for hist2, LBP-TOP, VLBP
demonstrate that excellent results can be obtained com- [49] and SIFT 3D [50] are used to extract motion cues
bining appearance feature and motion feature for dynamic from the whole key frames sequence. We also use Bag-of-
video analysis. Jain et al. [41] propose a two-stream fully Feature (BoF) [24, 51] to represent these appearance and
convolutional neural network which fuses together motion motion cues. At the end of the procedure, we concatenate
and appearance in a unified framework, and substantially weighted hist1 and hist2 to obtain the final representa-
improve the state-of-the-art results for segmenting unseen tion hist (as shown in Figure 2).
objects in videos. Xu et al. [42] consider exploiting the
appearance and motion information resided in the video 2.3. Hand Gesture Recognition Framework
with a attention mechanism for the image question an- The hand gesture recognition framework based on key
swering task. For this purpose, we propose a feature fusion frames and feature fusion is composed of two stages, train-
strategy to capture these two phases: appearance based ing and testing, which is summarized in Algorithm 2. In
approach can only be applied to each frame, which repre- the training stage, we first extract key frames SKeyframes
sents the differences of space merely; while, motion based using Algorithm 1 (step 3). Then we extract appear-
method can describe the evolution with the time. Thus ance features from each key frame using descriptors such
we combine appearance and motion feature for better de- as SURF, LBP, etc. After obtaining the appearance fea-
scribe image sequence. Meanwhile, to better weight these tures, we employ BoF [24, 51] to represent these features
two feature, we also introduce an efficient strategy to bal- for hist1q (step 4). Next we use LBP-TOP, VLBP or SIFT
ance them. Tang et al. [16] also propose a feature fu- 3D to extract motion features from the whole key frames
sion method which fuses features extracted from different sequence, producing the corresponding histogram hist2q
sleeves to boost the recognition performance. (step 5). After that, hist1q and hist2q are fed to separate
Figure 2 shows the whole proposed feature fusion pro- classifiers to obtain R = {Ra , Rm } (step 6). And then, we
cedure for the obtained key frames. After extracting key calculate α and β by Formula (9) and (10) (step 7). Then
frames, we take the key frames sequence in place of the the training histogram histq is constructed from αhist1q
original sequence. We begin by extracting key frames from and βhist2q (step 8). In the end of the iteration, we ob-
the original sequence, and then extract appearance and tain the training representation vector hist. Then hist
motion features (hist1 and hist2 in Figure 2) from the and corresponding labels Llabel are fed to a SVM classifier
key frames sequence, respectively. For further increase the (step 10). During the testing stage, testing hand gesture
importance of the useful feature, we add weights to appear- representation is obtained in the same way as the training
ance and motion features. By feeding hist1 and hist2 to stage (step 12). Thereby the trained SVM classifier is used
to predict the gesture label tlabel (step 13).
5
Figure 2: The framework of the proposed feature extraction and fusion methods.

Algorithm 2 The proposed hand gesture recognition in front of a fixed camera having coarsely isolated gestures
framework. in spatial and temporal dimensions.
Require: Northwestern University Hand Gesture dataset
L hand gesture videos for training, as shown in Figure 1(a), is a more diverse data set which contains 10 categories of
corresponds to the gesture labels Llabel ; dynamic hand gestures in total: move right, move left, ro-
Testing hand gesture video t. tate up, rotate down, move downright, move right-down,
Ensure: clockwise circle, counterclockwise circle, “Z” and cross.
The hand gesture label tlabel .
This dataset is performed by 15 subjects and each sub-
1: TRAINING STAGE:
ject contributes 70 sequences of these ten categories with
2: for q = 1 to L do
3:
q
SKeyframes ← Algorithm 1; seven postures (i.e., Fist, Fingers extended, “OK”, Index,
4: hist1q ← (SURF or GIST, etc) ∪ BoF; Side Hand, Side Index and Thumb).
5: hist2q ← (VLBP or LBP-TOP or SIFT 3D, etc) ∪ BoF; These two datasets mentioned above are both with
clear backgrounds, and sequences snipped tightly around
6: R = {Ra , Rm } ← training a classifier using hist1q and the gestures. However, how well will this method work on
hist2q ; videos from “the wild” with significant clutter, extraneous
7: α and β ← using by Formula (9) and (10); motion, continuous running video without pre-snipping?
8: histq ← {αhist1, βhist2}; To validate our approach, we introduce two new datasets,
9: end for called HandGesture and Action3D datasets.
10: Classifier ← hist ∪ Llabel ;
HandGesture data set consists of 132 video sequences
11: TESTING STAGE:
of 640 by 360 resolution, each of which recorded from a
12: Obtain hand gesture representation histt for testing t using
the same method as the training stage; different subject (7 males and 4 females) with 12 different
13: Obtain tlabel by the classifier after calculation; gestures (“0”-“9”, “NO” and “OK”).
14: return tlabel . We also acquired Action3D dataset which consisting
of 1620 image sequences of 6 hand gesture classes (box,
high wave, horizontal wave, curl, circle and hand up),
3. Experiments and Analysis which are defined by 2 different hands (right and left hand)
and 5 situations (sit, stand, with a pillow, with a laptop
3.1. Datasets and Settings and with a person). Each class contains 270 image se-
To evaluate the effectiveness of the proposed method, quences (5 different situations × 2 different hands × 3
we conduct experiments on two publicly available datasets times × 9 subjects). Each sequence was recorded in front
(Cambridge [52] and Northwestern University Hand Ges- of a fixed camera having roughly isolated gestures in space
ture datasets [53]) and two collected datasets (HandGes- and time. All video sequences were uniformly resized into
ture and Action3D hand gesture datasets, both will be re- 320 × 240 in our method.
leased after paper accepted). Some characteristics of these
datasets are listed in Table 1. 3.2. Parameter Analysis
Cambridge Hand Gesture dataset is a commonly Two parameters are involved in our framework: the
used benchmark gesture data set with 900 video clips of number of key frames N and the dictionary number D in
9 hand gesture classes defined by 3 primitive hand shapes BoF. Firstly, we extract N = 3, 4, ..., 9 key frames from
(i.e., flat, spread, V-shape) and 3 primitive motions (i.e., the original video, respectively. And then extract SURF
leftward, rightward, contract). For each class, it includes features from each key frame. Every key point detected by
100 sequences captured with 5 different illuminations, 10 SURF provides a 64-D vector describing the texture of it.
arbitrary motions and 2 subjects. Each sequence is recorded Finally, we adopt BoF to represent each key frame with
6
Table 1: Characteristics of the datasets used in our hand gesture recognition experiments.
Dataset # categories # videos # training # validation # testing
Cambridge 9 900 450 225 225
Northwestern 10 1,050 550 250 250
HandGesture 12 132 66 33 33
Action3D 6 1,620 810 405 405

dictionary D = 1, 2, 4, ..., 4096, respectively. The number are much better than appearance or motion based meth-
of training set, validation set and testing set please refer ods, which demonstrates the necessity of our feature fusion
to Table 1. We repeat all the experiments 20 times with strategy. Motion-based method achieve the worst results,
different random spits of the training and testing samples which can illustrate that in our task spatial cues is more
to obtain reliable results. The final classification accuracy important than the temporal cues. The spatial cues rep-
is reported as the average of each run. Figure 3 presents resents/extracts the difference between different gesture
the accuracy results on the four datasets. From Figure classes, also called inter-class differences. While the tem-
3 (a) and (c), the accuracy first rises to the peak when poral cues captures the difference among different frames
D = 64 and then drops after reaching the peak. However, in the same gesture sequences, also called intra-class dif-
as shown in Figure 3 (b) and (d), the accuracy reaching ferences. Inter-class differences are always greater than
the peak when D = 16. Thus, we set D = 64 on the intra-class differences, which means the spatial cues can
Cambridge and Northwestern datasets, and D = 16 on represent more discriminative feature than the temporal
our two proposed datasets. It is observe that the more key cues. In our task, we observe that the differences between
frames we have, the more time will be consumed. Thus, different types of gestures are much greater than the dif-
to balance accuracy and efficiency, we set N = 5 on all the ferences between the same gesture sequence, which means
four datasets in the following experiments. the spatial cues is more discriminative than the tempo-
ral cues. However, if hand gesture moves fast and change
3.3. Experiment Evaluation hugely in one sequence, the temporal cues could be more
To evaluate the necessity and efficiency of the proposed important.
strategy, we test it in multi-aspect: (1) necessity of key (4) Comparison with Different Clustering Methods. We
frames extraction; (2) different kernel tricks; (3) differ- also compare different clustering methods for the key frames
ent fusion strategies; (4) different clustering methods; (5) extraction. As shown in Table 5, we can see that density
performance comparisons with the state-of-the-art; (6) ef- clustering is much better than K-means, OPTICS [54] and
ficiency. (1)-(4) demonstrate the rationality and validity DBSCAN [55].
of the methods. (5) compares the proposed method with (5) Comparison with State-of-the-Arts. For the Cambridge
others. (6) shows its efficiency. and Northwestern datasets, we compare our results with
(1) Comparison with Different Key Frames Extraction Meth- the state-of-the-art methods in Tables 6 and 7. We achieve
ods. We discuss whether our key frames method is nec- 98.23% ± 0.84% and 96.89% ± 1.08% recognition accu-
essary or not here. For the Cambridge and Action3D racy on the Cambridge and Northwestern dataset, both
dataset, we only extract LBP-TOP, and then concate- of which exceed the other baseline methods.
nate the three orthogonal planes of LBP-TOP. For the (6) Efficiency. Finally, We investigate our approach in
Northwestern and HandGesture dataset, 200 points are terms of computation time of classifying one test video.
randomly selected from each video using SIFT 3D. As we We run our experiment on a 3.40-GHz i7-3770 CPU with
can see from Table 2, our approach outperforms the other 8 GB memory. The proposed method is implemented us-
four methods on both accuracy and efficiency, thereby our ing Matlab. Code and datasets will release after paper
approach is not only theoretical improvement, but also has accepted. As we can see from Table 8 that the time of
an empirical advantage. classifying one test sequence is 4.31s, 10.89s, 13.06s and
(2) Gaussian Kernel vs. Cutoff Kernel. We also compare 4.26s for the Cambridge, Northwestern, HandGesture and
the Gaussian and cutoff kernel. We adopt SURF to ex- Action3D datasets. We observe that the proposed key
tract feature from each key frame. Comparison results are frame extraction methods including entropy calculation
shown in the Table 3. As we can observe that there is small and density clustering can be finished within around 1s
different between using the Gaussian and cutoff kernel. on the Cambridge, Northwestern and Action3D datasets.
(3) Comparison with Different Feature Fusion Strategies. While for the HandGesture dataset which contains about
We demonstrate that combining appearance (hand pos- 200 frames in a single video, it only cost about 3s per video.
ture) and motion (the way hand is moving) boosts hand We also note that the most time-consuming part is feature
gesture recognition task here. Moreover, we also compare extraction, and we have two solutions to improve it, (i) we
different schemes based on appearance and motion, re- can reduce the size of images, we note that Cambridge
spectively. For feature fusion, we set α and β to 8 and and Action3D only consume about 4s, while Northwest-
1, respectively. As shown in Table 4, fusion strategies ern and HandGesture cost about 11s and 13s respectively.

7
(a) (b)

(c) (d)
Figure 3: Parameters N and D selection on the four datasets.

Table 2: Comparison between different key frames extraction methods on the Cambridge, Northwestern, HandGesture and Action3D datasets.
Cambridge Northwestern
Method
Accuracy Time(s) Accuracy Time(s)
Original Sequence 35.26% ± 3.15% 20,803 21.65% ± 1.23% 108,029
5 evenly-spaced frames (in time) 56.13% ± 5.46% 1,189 58.79% ± 2.64% 26,303
Zhao and Elgammal [29] 58.14% ± 3.36% 1,432 61.45% ± 3.45% 27,789
Carlsson and Sullivan [30] 50.57% ± 4.78% 1,631 51.27% ± 3.86% 29,568
Ours key frames method 60.78% ± 2.21% 1,152 64.24% ± 2.15% 25,214
HandGesture Action3D
Method
Accuracy Time(s) Accuracy Time(s)
Original Sequence 42.54% ± 4.61% 8,549 34.56% ± 2.65% 284,489
5 evenly-spaced frames (in time) 58.32% ± 3.88% 1,689 52.13% ± 2.31% 18,430
Zhao and Elgammal [29] 60.45% ± 4.56% 1,895 54.56% ± 1.97% 20,143
Carlsson and Sullivan [30] 50.78% ± 4.06% 2,154 46.34% ± 2.78% 23,768
Ours key frames method 65.18% ± 3.62% 1,645 56.13% ± 1.89% 16,294

The reason is that the image size of Cambridge and Ac-


Table 3: Comparison between Gaussian kernel and cutoff kernel on tion3D is 320 × 240, while the image size of Northwestern
the Cambridge, Northwestern, HandGesture and Action3D datasets. and HandGesture is 640 × 480; (ii) We can further reduce
Kernel Gaussian Kernel Cutoff Kernel the time for feature extraction by using a GPU like most
Cambridge 92.37% ± 1.67% 90.33% ± 2.78% deep learning methods do. Moreover, we also compare two
Northwestern 81.31% ± 1.49% 80.25% ± 1.86% methods ( [29] and [59] currently achieve best recognition
HandGesture 96.32% ± 3.35% 94.54% ± 2.78% results on the Cambridge and Northwestern datasets, re-
Action3D 96.26% ± 1.39% 93.65% ± 1.23% spectively.) on the time for classifying a test sequence
on the Cambridge, Northwestern, HandGesture and Ac-

8
Table 4: Comparison with different features fusion strategies on the Cambridge, Northwestern, HandGesture and Action3D datasets. N and
D denote the number of key frames and the dictionary number.
Appearance-based Dimension Cambridge Northwestern HandGesture Action3D
SURF N ∗D 92.37% ± 1.67% 81.31% ± 1.49% 96.32% ± 3.35% 96.26% ± 1.39%
GIST N ∗D 88.15% ± 1.65% 78.41% ± 1.91% 92.64% ± 2.89% 91.23% ± 1.84%
Motion-based - Cambridge Northwestern HandGesture Action3D
LBP-TOP 177 60.78% ± 2.21% 51.36% ± 2.16% 60.84% ± 2.61% 56.13% ± 1.89%
VLBP 16,386 50.36% ± 3.56% 42.11% ± 3.04% 49.78% ± 3.51% 44.26% ± 4.23%
SIFT 3D N ∗D 68.94% ± 4.81% 64.24% ± 2.15% 65.18% ± 3.62% 62.04% ± 2.89%
Appearance + Motion - Cambridge Northwestern HandGesture Action3D
SURF + LBP-TOP N ∗ D+177 95.75% ± 0.79% 93.54% ± 1.36% 97.25% ± 0.79% 98.53% ± 1.31%
SURF + VLBP N ∗ D+16,386 92.52% ± 1.27% 91.22% ± 0.95% 96.82% ± 0.95% 97.21% ± 0.94%
SURF + SIFT 3D 2∗N ∗D 98.23% ± 0.84% 96.89% ± 1.08% 99.21% ± 0.88% 98.98% ± 0.65%
GIST + LBP-TOP N ∗ D+177 91.87% ± 1.65% 86.26% ± 0.94% 93.56% ± 1.35% 93.11% ± 0.89%
GIST + VLBP N ∗ D+16,386 90.56% ± 0.87% 82.87% ± 1.84% 92.88% ± 1.21% 92.63% ± 0.64%
GIST + SIFT 3D 2∗N ∗D 93.52% ± 0.63% 88.54% ± 1.62% 94.16% ± 0.67% 94.21% ± 0.61%

Table 5: Comparison with different clustering methods on the Cambridge, Northwestern, HandGesture and Action3D datasets.
Clustering Method OPTICS [54] DBSCAN [55] K-means Density Clustering [35]
Cambridge 88.15% ± 1.51% 90.34% ± 1.78% 86.26% ± 2.51% 98.23% ± 0.84%
Northwestern 86.34% ± 2.45% 88.35% ± 1.67% 83.65% ± 1.06% 96.89% ± 1.08%
HandGesture 84.56% ± 1.89% 85.98% ± 1.76% 84.69% ± 1.98% 99.21% ± 0.88%
Action3D 83.56% ± 1.56% 87.43% ± 1.63% 82.36% ± 1.46% 98.98% ± 0.65%

Table 6: Comparison with the state-of-the-art methods on the Cambridge dataset.


Cambridge Methods Accuracy
Wong and Cipolla [56] Sparse Bayesian Classifier 44%
Niebles et al. [57] Spatial-Temporal Words 67%
Kim et al. [52] Tensor Canonical Correlation Analysis 82%
Kim and Cipolla [58] Canonical Correlation Analysis 82%
Liu and Shao [59] Genetic Programming 85%
Lui et al. [60] High Order Singular Value Decomposition 88%
Lui and Beveridge [61] Tangent Bundle 91%
Wong et al. [62] Probabilistic Latent Semantic Analysis 91.47%
Sanin et al. [63] Spatio-Temporal Covariance Descriptors 93%
Baraldi et al. [64] Dense Trajectories + Hand Segmentation 94%
Zhao and Elgammal [29] Information Theoretic 96.22%
Ours Key Frames + Feature Fusion 98.23% ± 0.84%

Table 7: Comparison between the state-of-the-art methods and our method on the Northwestern University dataset.
Northwestern Methods Accuracy
Liu and Shao [59] Genetic Programming 96.1%
Shen et al. [53] Motion Divergence fields 95.8%
Our method Key Frames + Feature Fusion 96.89% ± 1.08%

Table 8: Computation time for classifying a test sequence on the Cambridge, Northwestern, HandGesture and Action3D datasets.
Time Cambridge Northwestern HandGesture Action3D
Entropy Calculation 0.93s 0.84s 3.21s 0.75s
Density Clustering 0.31s 0.34s 0.43s 0.38s
Feature Extraction 3.07s 9.71s 9.42s 3.13s
SVM Classification 0.60 ms 0.51ms 0.46ms 0.65 ms
Our Full Model 4.31s 10.89s 13.06s 4.26s
Liu and Shao [59] 6.45s 13.32s 15.32s 6.43s
Zhao and Elgammal [29] 5.34s 11.78s 14.98s 5.21s

9
tion3D datasets. We re-implement both methods with the [7] L. Prasuhn, Y. Oyamada, Y. Mochizuki, H. Ishikawa, A hog-
same running settings for fair comparison, including hard- based hand gesture recognition system on a mobile device, in:
ICIP, 2014.
ware platform and programming language. The results [8] P. Neto, D. Pereira, J. Norberto Pires, A. P. Moreira, Real-
are shown in Table 8, and we can see that the proposed time and continuous hand gesture spotting: an approach based
method achieve better results than both methods. on artificial neural networks, in: ICRA, 2013.
[9] R. Schramm, C. Rosito Jung, E. Miranda, Dynamic time warp-
ing for music conducting gestures evaluation, IEEE TMM 17 (2)
4. Conclusion (2014) 243–255.
[10] S. Lian, W. Hu, K. Wang, Automatic user state recognition
In order to build a fast and robust gesture recognition for hand gesture based low-cost television control system, IEEE
TCE 60 (1) (2014) 107–115.
system, in this paper, we present a novel key frames ex- [11] W. T. Freeman, C. Weissman, Television control by hand ges-
traction method and feature fusion strategy. Considering tures, in: AFGRW, 1995.
the speed of recognition, we propose a new key frames [12] E. Ohn-Bar, M. M. Trivedi, Hand gesture recognition in real
time for automotive interfaces: A multimodal vision-based ap-
extraction method based on image entropy and density proach and evaluations, IEEE TITS 15 (6) (2014) 2368–2377.
clustering, which can greatly reduce the redundant infor- [13] E. Ohn-Bar, M. M. Trivedi, The power is in your hands: 3d
mation of original video. Moreover, we further propose an analysis of hand gestures in naturalistic video, in: CVPRW,
efficient feature fusion strategy which combines appear- 2013.
[14] S. Sathayanarayana, R. K. Satzoda, A. Carini, M. Lee, L. Sala-
ance and motion cues for robust hand gesture recogni- manca, J. Reilly, D. Forster, M. Bartlett, G. Littlewort, Towards
tion. Experimental results show that the proposed ap- automated understanding of student-tutor interactions using vi-
proach outperforms the state-of-the-art methods on the sual deictic gestures, in: CVPRW, 2014.
Cambridge (98.23% ± 0.84%) and Northwestern (96.89% ± [15] S. Sathyanarayana, G. Littlewort, M. Bartlett, Hand gestures
for intelligent tutoring systems: Dataset, techniques &amp;
1.08%) datasets. For evaluate our method on videos from evaluation, in: ICCVW, 2013.
“the wild” with significant clutter, extraneous motion and [16] H. Tang, W. Wang, D. Xu, Y. Yan, N. Sebe, Gesturegan for
no pre-snipping, we introduce two new datasets, namely hand gesture-to-gesture translation in the wild, in: ACM MM,
HandGesture and Action3D. We achieve accuracy of 99.21%± 2018.
[17] L. Liu, L. Shao, Learning discriminative representations from
0.88% and 98.98% ± 0.65% on the HandGesture and Ac- rgb-d video data., in: IJCAI, 2013.
tion3D datasets, respectively. From the respect of the [18] M. Yu, L. Liu, L. Shao, Structure-preserving binary representa-
recognition speed, we also achieve better results than the tions for rgb-d action recognition, IEEE TPAMI 38 (8) (2016)
state-of-the-art approaches for classifying one test sequence 1651–1664.
[19] H. Tang, H. Liu, W. Xiao, Gender classification using pyramid
on the Cambridge, Northwestern, HandGesture and Ac- segmentation for unconstrained back-facing video sequences, in:
tion3D datasets. ACM MM, 2015.
[20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learn-
ing spatiotemporal features with 3d convolutional networks, in:
Acknowledgments ICCV, 2015.
[21] L. Shao, L. Liu, M. Yu, Kernelized multiview projection for
This work is partially supported by National Natural robust action recognition, Springer IJCV 118 (2) (2016) 115–
129.
Science Foundation of China (NSFC, U1613209), Shen- [22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang,
zhen Key Laboratory for Intelligent Multimedia and Vir- L. Van Gool, Temporal segment networks: Towards good prac-
tual Reality (ZDSYS201703031405467), Scientific Research tices for deep action recognition, in: ECCV, 2016.
Project of Shenzhen City (JCYJ20170306164738129). [23] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks
for human action recognition, IEEE TPAMI 35 (1) (2013) 221–
231.
[24] H. Liu, H. Tang, W. Xiao, Z. Guo, L. Tian, Y. Gao, Sequen-
References tial bag-of-words model for human action classification, CAAI
Transactions on Intelligence Technology 1 (2) (2016) 125–136.
[25] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, Y. Wang, Zero-
References shot action recognition with error-correcting output codes, in:
CVPR, 2017.
[1] C. Wang, Z. Liu, S.-C. Chan, Superpixel-based hand gesture [26] H. Liu, L. Tian, M. Liu, H. Tang, Sdm-bsm: A fusing depth
recognition with kinect depth camera, IEEE TMM 17 (1) (2015) scheme for human action recognition, in: ICIP, 2015.
29–39. [27] K. Simonyan, A. Zisserman, Two-stream convolutional net-
[2] Z. Ren, J. Yuan, J. Meng, Z. Zhang, Robust part-based hand works for action recognition in videos, in: NIPS, 2014.
gesture recognition using kinect sensor, IEEE TMM 15 (5) [28] C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-
(2013) 1110–1120. stream network fusion for video action recognition, in: CVPR,
[3] H. Hikawa, K. Kaida, Novel fpga implementation of hand sign 2016.
recognition system with som–hebb classifier, IEEE TCSVT [29] Z. Zhao, A. M. Elgammal, Information theoretic key frame se-
25 (1) (2015) 153–166. lection for action recognition., in: BMVC, 2008.
[4] G. Marin, F. Dominio, P. Zanuttigh, Hand gesture recognition [30] S. Carlsson, J. Sullivan, Action recognition by shape matching
with leap motion and kinect devices, in: ICIP, 2014. to key frames, in: Workshop on Models versus Exemplars in
[5] A. Kuznetsova, L. Leal-Taixé, B. Rosenhahn, Real-time sign Computer Vision, 2001.
language recognition using a consumer depth camera, in: IC- [31] A. Brink, Using spatial information as an aid to maximum en-
CVW, 2013. tropy image threshold selection, Elsevier PRL 17 (1) (1996)
[6] Y. Yao, Y. Fu, Contour model based hand-gesture recognition 29–36.
using kinect sensor, IEEE TCSVT 24 (11) (2014) 1935–1944.

10
[32] B. S. Min, D. K. Lim, S. J. Kim, J. H. Lee, A novel method of entation images., in: BMVC, 2005.
determining parameters of clahe based on image entropy, Inter- [57] J. C. Niebles, H. Wang, L. Fei-Fei, Unsupervised learning of
national Journal of Software Engineering and Its Applications human action categories using spatial-temporal words, Springer
7 (5) (2013) 113–120. IJCV 79 (3) (2008) 299–318.
[33] X. Wang, C. Chen, Ship detection for complex background sar [58] T.-K. Kim, R. Cipolla, Canonical correlation analysis of video
images based on a multiscale variance weighted image entropy volume tensors for action categorization and detection, IEEE
method, IEEE Geoscience and Remote Sensing Letters 14 (2) TPAMI 31 (8) (2009) 1415–1428.
(2017) 184–187. [59] L. Liu, L. Shao, Synthesis of spatio-temporal descriptors for
[34] L. Shao, L. Ji, Motion histogram analysis based key frame ex- dynamic hand gesture recognition using genetic programming,
traction for human action/activity representation, in: CRV, in: FGW, 2013.
2009. [60] Y. M. Lui, J. R. Beveridge, M. Kirby, Action classification on
[35] A. Rodriguez, A. Laio, Clustering by fast search and find of product manifolds, in: CVPR, 2010.
density peaks, Science 344 (6191) (2014) 1492–1496. [61] Y. M. Lui, J. R. Beveridge, Tangent bundle for human action
[36] S. K. Kuanar, R. Panda, A. S. Chowdhury, Video key frame ex- recognition, in: FG, 2011.
traction through dynamic delaunay clustering with a structural [62] S.-F. Wong, T.-K. Kim, R. Cipolla, Learning motion categories
constraint, Elsevier JVCIP 24 (7) (2013) 1212–1227. using both semantic and structural information, in: CVPR,
[37] S. E. F. De Avila, A. P. B. Lopes, A. da Luz, A. de Albu- 2007.
querque Araújo, Vsumm: A mechanism designed to produce [63] A. Sanin, C. Sanderson, M. T. Harandi, B. C. Lovell, Spatio-
static video summaries and a novel evaluation method, Elsevier temporal covariance descriptors for action and gesture recogni-
PRL 32 (1) (2011) 56–68. tion, in: WACV, 2013.
[38] R. VáZquez-Martı́N, A. Bandera, Spatio-temporal feature- [64] L. Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara, Gesture
based keyframe detection from video shots using spectral clus- recognition in ego-centric videos using dense trajectories and
tering, Elsevier PRL 34 (7) (2013) 770–779. hand segmentation, in: CVPRW, 2014.
[39] R. Panda, S. K. Kuanar, A. S. Chowdhury, Scalable video sum-
marization using skeleton graph and random walk, in: ICPR,
2014.
[40] A. Hadid, M. Pietikäinen, Combining appearance and motion
for face and gender recognition from videos, Elsevier PR 42 (11)
(2009) 2818–2827.
[41] S. D. Jain, B. Xiong, K. Grauman, Fusionseg: Learning to com-
bine motion and appearance for fully automatic segmention of
generic objects in videos, in: CVPR, 2017.
[42] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, Y. Zhuang,
Video question answering via gradually refined attention over
appearance and motion, in: ACM MM, 2017.
[43] A. Oliva, A. Torralba, Modeling the shape of the scene: A holis-
tic representation of the spatial envelope, Springer IJCV 42 (3)
(2001) 145–175.
[44] R. Arandjelovic, A. Zisserman, Three things everyone should
know to improve object retrieval, in: CVPR, 2012.
[45] H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust
features, in: ECCV, 2006.
[46] N. Dalal, B. Triggs, Histograms of oriented gradients for human
detection, in: CVPR, 2005.
[47] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-
scale and rotation invariant texture classification with local bi-
nary patterns, IEEE TPAMI 24 (7) (2002) 971–987.
[48] Z. Guo, L. Zhang, D. Zhang, A completed modeling of local
binary pattern operator for texture classification, IEEE TIP
19 (6) (2010) 1657–1663.
[49] G. Zhao, M. Pietikainen, Dynamic texture recognition using
local binary patterns with an application to facial expressions,
IEEE TPAMI 29 (6) (2007) 915–928.
[50] P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor
and its application to action recognition, in: ACM MM, 2007.
[51] N. H. Dardas, N. D. Georganas, Real-time hand gesture detec-
tion and recognition using bag-of-features and support vector
machine techniques, IEEE TIM 60 (11) (2011) 3592–3607.
[52] T.-K. Kim, K.-Y. K. Wong, R. Cipolla, Tensor canonical corre-
lation analysis for action classification, in: CVPR, 2007.
[53] X. Shen, G. Hua, L. Williams, Y. Wu, Dynamic hand gesture
recognition: An exemplar-based approach from motion diver-
gence fields, Elsevier IVC 30 (3) (2012) 227–235.
[54] M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander, Optics:
ordering points to identify the clustering structure, ACM Sig-
mod Record 28 (2) (1999) 49–60.
[55] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based
algorithm for discovering clusters in large spatial databases with
noise., in: KDD, 1996.
[56] S.-F. Wong, R. Cipolla, Real-time interpretation of hand mo-
tions using a sparse bayesian classifier on motion gradient ori-

11

You might also like