Fast and Robust Dynamic Hand Gesture Recognition Via Key Frames Extraction and Feature Fusion
Fast and Robust Dynamic Hand Gesture Recognition Via Key Frames Extraction and Feature Fusion
Fast and Robust Dynamic Hand Gesture Recognition Via Key Frames Extraction and Feature Fusion
Abstract
Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role
arXiv:1901.04622v1 [cs.CV] 15 Jan 2019
in natural human-computer interface. Although great progress has been made recently, fast and robust hand gesture
recognition remains an open problem, since the existing methods have not well balanced the performance and the effi-
ciency simultaneously. To bridge it, this work combines image entropy and density clustering to exploit the key frames
from hand gesture video for further feature extraction, which can improve the efficiency of recognition. Moreover, a
feature fusion strategy is also proposed to further improve feature representation, which elevates the performance of
recognition. To validate our approach in a “wild” environment, we also introduce two new datasets called HandGesture
and Action3D datasets. Experiments consistently demonstrate that our strategy achieves competitive results on North-
western University, Cambridge, HandGesture and Action3D hand gesture datasets. Our code and datasets will release
at https://github.com/Ha0Tang/HandGestureRecognition.
Keywords: Hand gesture recognition; Key frames extraction; Feature fusion; Fast; Robust.
(b) Calculate the image entropy. (c) Select the local peak points. (d) Calculate ρ and δ.
(e) Select the number of clustering. In this (f) The final results of clustering. The points
case, we choose 5 clusters. of 2, 5, 8, 9 and 10 are the clustering centers,
therefore, the corresponding frames (2, 9, 14,
20 and 26) are the key frames of original se-
quence.
(g) The key frames are the 2, 9, 14, 20 and 26 frames. Now we use this sequence to replace the
original sequences for the next step.
Figure 1: The framework of the proposed key frames extraction method.
3
p = {p1 , p2 , ..., pn }. For the image frames fi , their image and its distance δPk from points of higher density. Both
entropy can be defined as: these quantities depend only on the distances dPk Pl be-
X tween data points, which are assumed to satisfy the trian-
E(fi ) = − pfi (j)logpfi (j), (2) gular inequality. The local density ρPk of data point Pk is
j defined as: X
ρPk = χ(dPk Pl − dc ), (6)
where pfi (j) denotes the probability density function of Pl
frame fi , which could be obtained by normalizing their
histogram of gray-scale pixel intensities. Next we map the where χ(x) = 1 if x < 0 and χ(x) = 0 when otherwise,
value E(fi ) to a two-dimensional coordinate space (the and dc is a cutoff distance. Basically, ρPk is equal to the
E(fi ) vs. i plot). number of points that are closer than dc to point Pk . The
algorithm is sensitive only to the relative magnitude of ρPk
2.1.2. Local Extreme Points. in different points, which implies that, the results of the
Secondly, we pick the local extreme points in the two- analysis are robust with respect to the choice of dc for large
dimensional coordinate space, illustrated by Figure 1(c). datasets. A different way of defining ρPk as:
Local extreme points include the local maximum points dP P
k l )2
X
and local minimum points. Local maximum points can be ρPk = e−( dc . (7)
calculated as follows: Pl
(
E(fi ), if E(fi ) > E(fi+1 ) & E(fi ) > E(fi−1 ). δPk is measured by finding the minimum distance be-
Pmax = tween the point Pk and any other point with higher den-
remove, else.
sity:
(3) δPk = min (dPk Pl ), (8)
Local minimum points can also be calculated by the fol- Pl :ρPl >ρPk
lowing formula:
which uses a Gaussian kernel to calculate. We can see
( from these two kernels, cutoff kernel is discrete value, while
E(fi ), if E(fi+1 ) > E(fi ) & E(fi−1 ) > E(fi ).
Pmin = Gaussian kernel is a continuous value, which guarantees a
remove, else. smaller probability of conflict.
(4) As we can see from Figure 1(d), we calculate ρ and δ us-
where i = 1, 2, ..., n. Therefore, local extreme points Pextreme ing Formula (6) and (8). Then select the number of cluster-
can be united by: ing center N , namely, the N largest δ values, e.g., in Figure
1(e), we select 5 cluster centers. Figure 1(f) illustrates the
Pextreme = Pmax ∪ Pmin . (5) final results of clustering, in which the points of 2, 5, 8,
9 and 10 are the clustering centers, therefore, the corre-
Local extreme points could further extract representa-
sponding x-coordinates (the 2, 9, 14, 20 and 26 frames,
tive frames from the original video sequence. This pro-
shown in figure 1(g)) are the key frames SKeyframes in the
cedure could be viewed as finding local representatives to
original video V (shown in Figure 1(a)). The pipeline of
roughly describe the original sequence.
the proposed key frames extraction method is summarized
in Algorithm 1. Note that the proposed density cluster-
2.1.3. Density Clustering.
ing cannot handle the situation where the entropy of the
After obtaining the extreme points, as shown in Figure video sequence is monotone increasing or decreasing since
1(c), we try to cluster these points into N (N is a pre- we need to select the local extreme points. While in our ex-
defined constant for all the datasets) categories, as, {1, 2}, periments, we observer that there is no one video sequence
{3, 4, 5, 6, 7}, {8}, {9} and {10}. The distribution of these which frame entropy is monotone increasing or decreasing
extreme points have the characteristics, the cluster centers all the time, it means we can always obtain the local ex-
are surrounded by neighbors with a lower local density and treme points.
that they are at a relatively large distance from any points
with a higher local density. 2.2. Feature Fusion Strategy
Therefore, we adopt density clustering [35] to further
cluster these extreme points Pextreme , as shown in Figure In view of each key frame and the relationship between
1(d-f). Density clustering could better catch the delicate each key frame, for better representing each key frame
structure of 2-D space where extreme points reside than sequence, we not only try to describe each frame of key
traditional clustering strategies, e.g. K-means. First, we frame, but also the variation between the key frames. That
search for a local density maximum point as a cluster cen- is, we not only extract the most representative frames and
ter, and then spread the cluster label from high density map them into a simple 2-D space, but also describe how
to low-density points sequentially. For each data point these frames move in this space. We believe this two-phase
Pk , we compute two quantities: corresponding local den- strategy could set up a “holographic” description of each
sity (neighborhood has a density, not the data point) ρPk hand gesture sequence. Hadid and Pietikäinen [40] also
4
Algorithm 1 The proposed key frames extraction the SVM classifier separately, we obtain two classification
method. accuracy R = {Ra , Rm }. Based on the assumption that
Require: the higher the rate is, the better representation becomes,
The original hand gesture video V , as shown in Figure 1(a) we compute the weights as follows:
and the number of key frames N (N is a pre-defined con-
stant for all the datasets). R − min(R)
T = . (9)
Ensure: (100 − min(R))/10
The key frames SKeyframes in original video V , as shown in
Figure 1(g). Finally, considering that the weight of the lowest rate is
1: Calculate image entropy E(fi ) of each frame in V using 1, the other weights can be obtained according to a linear
Formula 2; relationship of their differences to that with the lowest
2: Map E(fi ) to a two-dimensional coordinate space; rate. The final step is written as:
3: Find local maximum points Pmax in the two-dimensional
coordinate space using Formula (3); T 1 = round(T )
4: Find local minimum points Pmin in the two-dimensional
coordinate space using Formula (4);
T × ((max(T 1) − 1))
T2 = +1 (10)
5: Obtain Pextreme by uniting the local maximum points Pmax max(T )
and local minimum points Pmin using Formula (5); W = round(T 2)
6: Calculate ρ for each point in Pextreme using Formula (6) or
(7); in which W = {α, β} is the weight vector corresponding
7: Calculate δ for each point in Pextreme using Formula (8); to hist1 and hist2.
8: Draw decision graph like Figure 1(d); There are many existing descriptors for us to extract
9: Choose the N largest δ values as the clustering centers, as
hist1 and hist2. In other words, the fusion strategy does
shown in Figure 1(e);
not depend on specific descriptors, which guarantees its
10: The corresponding x-coordinate of N clustering centers are
the key frames SKeyframes . great potential in applications. In term of hist1, we can
11: return SKeyframes . use Gist [43], rootSIFT [44], HSV, SURF [45], HOG [46],
LBP [47] or its variation CLBP [48] to extract appear-
ance cue of each image. As for hist2, LBP-TOP, VLBP
demonstrate that excellent results can be obtained com- [49] and SIFT 3D [50] are used to extract motion cues
bining appearance feature and motion feature for dynamic from the whole key frames sequence. We also use Bag-of-
video analysis. Jain et al. [41] propose a two-stream fully Feature (BoF) [24, 51] to represent these appearance and
convolutional neural network which fuses together motion motion cues. At the end of the procedure, we concatenate
and appearance in a unified framework, and substantially weighted hist1 and hist2 to obtain the final representa-
improve the state-of-the-art results for segmenting unseen tion hist (as shown in Figure 2).
objects in videos. Xu et al. [42] consider exploiting the
appearance and motion information resided in the video 2.3. Hand Gesture Recognition Framework
with a attention mechanism for the image question an- The hand gesture recognition framework based on key
swering task. For this purpose, we propose a feature fusion frames and feature fusion is composed of two stages, train-
strategy to capture these two phases: appearance based ing and testing, which is summarized in Algorithm 2. In
approach can only be applied to each frame, which repre- the training stage, we first extract key frames SKeyframes
sents the differences of space merely; while, motion based using Algorithm 1 (step 3). Then we extract appear-
method can describe the evolution with the time. Thus ance features from each key frame using descriptors such
we combine appearance and motion feature for better de- as SURF, LBP, etc. After obtaining the appearance fea-
scribe image sequence. Meanwhile, to better weight these tures, we employ BoF [24, 51] to represent these features
two feature, we also introduce an efficient strategy to bal- for hist1q (step 4). Next we use LBP-TOP, VLBP or SIFT
ance them. Tang et al. [16] also propose a feature fu- 3D to extract motion features from the whole key frames
sion method which fuses features extracted from different sequence, producing the corresponding histogram hist2q
sleeves to boost the recognition performance. (step 5). After that, hist1q and hist2q are fed to separate
Figure 2 shows the whole proposed feature fusion pro- classifiers to obtain R = {Ra , Rm } (step 6). And then, we
cedure for the obtained key frames. After extracting key calculate α and β by Formula (9) and (10) (step 7). Then
frames, we take the key frames sequence in place of the the training histogram histq is constructed from αhist1q
original sequence. We begin by extracting key frames from and βhist2q (step 8). In the end of the iteration, we ob-
the original sequence, and then extract appearance and tain the training representation vector hist. Then hist
motion features (hist1 and hist2 in Figure 2) from the and corresponding labels Llabel are fed to a SVM classifier
key frames sequence, respectively. For further increase the (step 10). During the testing stage, testing hand gesture
importance of the useful feature, we add weights to appear- representation is obtained in the same way as the training
ance and motion features. By feeding hist1 and hist2 to stage (step 12). Thereby the trained SVM classifier is used
to predict the gesture label tlabel (step 13).
5
Figure 2: The framework of the proposed feature extraction and fusion methods.
Algorithm 2 The proposed hand gesture recognition in front of a fixed camera having coarsely isolated gestures
framework. in spatial and temporal dimensions.
Require: Northwestern University Hand Gesture dataset
L hand gesture videos for training, as shown in Figure 1(a), is a more diverse data set which contains 10 categories of
corresponds to the gesture labels Llabel ; dynamic hand gestures in total: move right, move left, ro-
Testing hand gesture video t. tate up, rotate down, move downright, move right-down,
Ensure: clockwise circle, counterclockwise circle, “Z” and cross.
The hand gesture label tlabel .
This dataset is performed by 15 subjects and each sub-
1: TRAINING STAGE:
ject contributes 70 sequences of these ten categories with
2: for q = 1 to L do
3:
q
SKeyframes ← Algorithm 1; seven postures (i.e., Fist, Fingers extended, “OK”, Index,
4: hist1q ← (SURF or GIST, etc) ∪ BoF; Side Hand, Side Index and Thumb).
5: hist2q ← (VLBP or LBP-TOP or SIFT 3D, etc) ∪ BoF; These two datasets mentioned above are both with
clear backgrounds, and sequences snipped tightly around
6: R = {Ra , Rm } ← training a classifier using hist1q and the gestures. However, how well will this method work on
hist2q ; videos from “the wild” with significant clutter, extraneous
7: α and β ← using by Formula (9) and (10); motion, continuous running video without pre-snipping?
8: histq ← {αhist1, βhist2}; To validate our approach, we introduce two new datasets,
9: end for called HandGesture and Action3D datasets.
10: Classifier ← hist ∪ Llabel ;
HandGesture data set consists of 132 video sequences
11: TESTING STAGE:
of 640 by 360 resolution, each of which recorded from a
12: Obtain hand gesture representation histt for testing t using
the same method as the training stage; different subject (7 males and 4 females) with 12 different
13: Obtain tlabel by the classifier after calculation; gestures (“0”-“9”, “NO” and “OK”).
14: return tlabel . We also acquired Action3D dataset which consisting
of 1620 image sequences of 6 hand gesture classes (box,
high wave, horizontal wave, curl, circle and hand up),
3. Experiments and Analysis which are defined by 2 different hands (right and left hand)
and 5 situations (sit, stand, with a pillow, with a laptop
3.1. Datasets and Settings and with a person). Each class contains 270 image se-
To evaluate the effectiveness of the proposed method, quences (5 different situations × 2 different hands × 3
we conduct experiments on two publicly available datasets times × 9 subjects). Each sequence was recorded in front
(Cambridge [52] and Northwestern University Hand Ges- of a fixed camera having roughly isolated gestures in space
ture datasets [53]) and two collected datasets (HandGes- and time. All video sequences were uniformly resized into
ture and Action3D hand gesture datasets, both will be re- 320 × 240 in our method.
leased after paper accepted). Some characteristics of these
datasets are listed in Table 1. 3.2. Parameter Analysis
Cambridge Hand Gesture dataset is a commonly Two parameters are involved in our framework: the
used benchmark gesture data set with 900 video clips of number of key frames N and the dictionary number D in
9 hand gesture classes defined by 3 primitive hand shapes BoF. Firstly, we extract N = 3, 4, ..., 9 key frames from
(i.e., flat, spread, V-shape) and 3 primitive motions (i.e., the original video, respectively. And then extract SURF
leftward, rightward, contract). For each class, it includes features from each key frame. Every key point detected by
100 sequences captured with 5 different illuminations, 10 SURF provides a 64-D vector describing the texture of it.
arbitrary motions and 2 subjects. Each sequence is recorded Finally, we adopt BoF to represent each key frame with
6
Table 1: Characteristics of the datasets used in our hand gesture recognition experiments.
Dataset # categories # videos # training # validation # testing
Cambridge 9 900 450 225 225
Northwestern 10 1,050 550 250 250
HandGesture 12 132 66 33 33
Action3D 6 1,620 810 405 405
dictionary D = 1, 2, 4, ..., 4096, respectively. The number are much better than appearance or motion based meth-
of training set, validation set and testing set please refer ods, which demonstrates the necessity of our feature fusion
to Table 1. We repeat all the experiments 20 times with strategy. Motion-based method achieve the worst results,
different random spits of the training and testing samples which can illustrate that in our task spatial cues is more
to obtain reliable results. The final classification accuracy important than the temporal cues. The spatial cues rep-
is reported as the average of each run. Figure 3 presents resents/extracts the difference between different gesture
the accuracy results on the four datasets. From Figure classes, also called inter-class differences. While the tem-
3 (a) and (c), the accuracy first rises to the peak when poral cues captures the difference among different frames
D = 64 and then drops after reaching the peak. However, in the same gesture sequences, also called intra-class dif-
as shown in Figure 3 (b) and (d), the accuracy reaching ferences. Inter-class differences are always greater than
the peak when D = 16. Thus, we set D = 64 on the intra-class differences, which means the spatial cues can
Cambridge and Northwestern datasets, and D = 16 on represent more discriminative feature than the temporal
our two proposed datasets. It is observe that the more key cues. In our task, we observe that the differences between
frames we have, the more time will be consumed. Thus, different types of gestures are much greater than the dif-
to balance accuracy and efficiency, we set N = 5 on all the ferences between the same gesture sequence, which means
four datasets in the following experiments. the spatial cues is more discriminative than the tempo-
ral cues. However, if hand gesture moves fast and change
3.3. Experiment Evaluation hugely in one sequence, the temporal cues could be more
To evaluate the necessity and efficiency of the proposed important.
strategy, we test it in multi-aspect: (1) necessity of key (4) Comparison with Different Clustering Methods. We
frames extraction; (2) different kernel tricks; (3) differ- also compare different clustering methods for the key frames
ent fusion strategies; (4) different clustering methods; (5) extraction. As shown in Table 5, we can see that density
performance comparisons with the state-of-the-art; (6) ef- clustering is much better than K-means, OPTICS [54] and
ficiency. (1)-(4) demonstrate the rationality and validity DBSCAN [55].
of the methods. (5) compares the proposed method with (5) Comparison with State-of-the-Arts. For the Cambridge
others. (6) shows its efficiency. and Northwestern datasets, we compare our results with
(1) Comparison with Different Key Frames Extraction Meth- the state-of-the-art methods in Tables 6 and 7. We achieve
ods. We discuss whether our key frames method is nec- 98.23% ± 0.84% and 96.89% ± 1.08% recognition accu-
essary or not here. For the Cambridge and Action3D racy on the Cambridge and Northwestern dataset, both
dataset, we only extract LBP-TOP, and then concate- of which exceed the other baseline methods.
nate the three orthogonal planes of LBP-TOP. For the (6) Efficiency. Finally, We investigate our approach in
Northwestern and HandGesture dataset, 200 points are terms of computation time of classifying one test video.
randomly selected from each video using SIFT 3D. As we We run our experiment on a 3.40-GHz i7-3770 CPU with
can see from Table 2, our approach outperforms the other 8 GB memory. The proposed method is implemented us-
four methods on both accuracy and efficiency, thereby our ing Matlab. Code and datasets will release after paper
approach is not only theoretical improvement, but also has accepted. As we can see from Table 8 that the time of
an empirical advantage. classifying one test sequence is 4.31s, 10.89s, 13.06s and
(2) Gaussian Kernel vs. Cutoff Kernel. We also compare 4.26s for the Cambridge, Northwestern, HandGesture and
the Gaussian and cutoff kernel. We adopt SURF to ex- Action3D datasets. We observe that the proposed key
tract feature from each key frame. Comparison results are frame extraction methods including entropy calculation
shown in the Table 3. As we can observe that there is small and density clustering can be finished within around 1s
different between using the Gaussian and cutoff kernel. on the Cambridge, Northwestern and Action3D datasets.
(3) Comparison with Different Feature Fusion Strategies. While for the HandGesture dataset which contains about
We demonstrate that combining appearance (hand pos- 200 frames in a single video, it only cost about 3s per video.
ture) and motion (the way hand is moving) boosts hand We also note that the most time-consuming part is feature
gesture recognition task here. Moreover, we also compare extraction, and we have two solutions to improve it, (i) we
different schemes based on appearance and motion, re- can reduce the size of images, we note that Cambridge
spectively. For feature fusion, we set α and β to 8 and and Action3D only consume about 4s, while Northwest-
1, respectively. As shown in Table 4, fusion strategies ern and HandGesture cost about 11s and 13s respectively.
7
(a) (b)
(c) (d)
Figure 3: Parameters N and D selection on the four datasets.
Table 2: Comparison between different key frames extraction methods on the Cambridge, Northwestern, HandGesture and Action3D datasets.
Cambridge Northwestern
Method
Accuracy Time(s) Accuracy Time(s)
Original Sequence 35.26% ± 3.15% 20,803 21.65% ± 1.23% 108,029
5 evenly-spaced frames (in time) 56.13% ± 5.46% 1,189 58.79% ± 2.64% 26,303
Zhao and Elgammal [29] 58.14% ± 3.36% 1,432 61.45% ± 3.45% 27,789
Carlsson and Sullivan [30] 50.57% ± 4.78% 1,631 51.27% ± 3.86% 29,568
Ours key frames method 60.78% ± 2.21% 1,152 64.24% ± 2.15% 25,214
HandGesture Action3D
Method
Accuracy Time(s) Accuracy Time(s)
Original Sequence 42.54% ± 4.61% 8,549 34.56% ± 2.65% 284,489
5 evenly-spaced frames (in time) 58.32% ± 3.88% 1,689 52.13% ± 2.31% 18,430
Zhao and Elgammal [29] 60.45% ± 4.56% 1,895 54.56% ± 1.97% 20,143
Carlsson and Sullivan [30] 50.78% ± 4.06% 2,154 46.34% ± 2.78% 23,768
Ours key frames method 65.18% ± 3.62% 1,645 56.13% ± 1.89% 16,294
8
Table 4: Comparison with different features fusion strategies on the Cambridge, Northwestern, HandGesture and Action3D datasets. N and
D denote the number of key frames and the dictionary number.
Appearance-based Dimension Cambridge Northwestern HandGesture Action3D
SURF N ∗D 92.37% ± 1.67% 81.31% ± 1.49% 96.32% ± 3.35% 96.26% ± 1.39%
GIST N ∗D 88.15% ± 1.65% 78.41% ± 1.91% 92.64% ± 2.89% 91.23% ± 1.84%
Motion-based - Cambridge Northwestern HandGesture Action3D
LBP-TOP 177 60.78% ± 2.21% 51.36% ± 2.16% 60.84% ± 2.61% 56.13% ± 1.89%
VLBP 16,386 50.36% ± 3.56% 42.11% ± 3.04% 49.78% ± 3.51% 44.26% ± 4.23%
SIFT 3D N ∗D 68.94% ± 4.81% 64.24% ± 2.15% 65.18% ± 3.62% 62.04% ± 2.89%
Appearance + Motion - Cambridge Northwestern HandGesture Action3D
SURF + LBP-TOP N ∗ D+177 95.75% ± 0.79% 93.54% ± 1.36% 97.25% ± 0.79% 98.53% ± 1.31%
SURF + VLBP N ∗ D+16,386 92.52% ± 1.27% 91.22% ± 0.95% 96.82% ± 0.95% 97.21% ± 0.94%
SURF + SIFT 3D 2∗N ∗D 98.23% ± 0.84% 96.89% ± 1.08% 99.21% ± 0.88% 98.98% ± 0.65%
GIST + LBP-TOP N ∗ D+177 91.87% ± 1.65% 86.26% ± 0.94% 93.56% ± 1.35% 93.11% ± 0.89%
GIST + VLBP N ∗ D+16,386 90.56% ± 0.87% 82.87% ± 1.84% 92.88% ± 1.21% 92.63% ± 0.64%
GIST + SIFT 3D 2∗N ∗D 93.52% ± 0.63% 88.54% ± 1.62% 94.16% ± 0.67% 94.21% ± 0.61%
Table 5: Comparison with different clustering methods on the Cambridge, Northwestern, HandGesture and Action3D datasets.
Clustering Method OPTICS [54] DBSCAN [55] K-means Density Clustering [35]
Cambridge 88.15% ± 1.51% 90.34% ± 1.78% 86.26% ± 2.51% 98.23% ± 0.84%
Northwestern 86.34% ± 2.45% 88.35% ± 1.67% 83.65% ± 1.06% 96.89% ± 1.08%
HandGesture 84.56% ± 1.89% 85.98% ± 1.76% 84.69% ± 1.98% 99.21% ± 0.88%
Action3D 83.56% ± 1.56% 87.43% ± 1.63% 82.36% ± 1.46% 98.98% ± 0.65%
Table 7: Comparison between the state-of-the-art methods and our method on the Northwestern University dataset.
Northwestern Methods Accuracy
Liu and Shao [59] Genetic Programming 96.1%
Shen et al. [53] Motion Divergence fields 95.8%
Our method Key Frames + Feature Fusion 96.89% ± 1.08%
Table 8: Computation time for classifying a test sequence on the Cambridge, Northwestern, HandGesture and Action3D datasets.
Time Cambridge Northwestern HandGesture Action3D
Entropy Calculation 0.93s 0.84s 3.21s 0.75s
Density Clustering 0.31s 0.34s 0.43s 0.38s
Feature Extraction 3.07s 9.71s 9.42s 3.13s
SVM Classification 0.60 ms 0.51ms 0.46ms 0.65 ms
Our Full Model 4.31s 10.89s 13.06s 4.26s
Liu and Shao [59] 6.45s 13.32s 15.32s 6.43s
Zhao and Elgammal [29] 5.34s 11.78s 14.98s 5.21s
9
tion3D datasets. We re-implement both methods with the [7] L. Prasuhn, Y. Oyamada, Y. Mochizuki, H. Ishikawa, A hog-
same running settings for fair comparison, including hard- based hand gesture recognition system on a mobile device, in:
ICIP, 2014.
ware platform and programming language. The results [8] P. Neto, D. Pereira, J. Norberto Pires, A. P. Moreira, Real-
are shown in Table 8, and we can see that the proposed time and continuous hand gesture spotting: an approach based
method achieve better results than both methods. on artificial neural networks, in: ICRA, 2013.
[9] R. Schramm, C. Rosito Jung, E. Miranda, Dynamic time warp-
ing for music conducting gestures evaluation, IEEE TMM 17 (2)
4. Conclusion (2014) 243–255.
[10] S. Lian, W. Hu, K. Wang, Automatic user state recognition
In order to build a fast and robust gesture recognition for hand gesture based low-cost television control system, IEEE
TCE 60 (1) (2014) 107–115.
system, in this paper, we present a novel key frames ex- [11] W. T. Freeman, C. Weissman, Television control by hand ges-
traction method and feature fusion strategy. Considering tures, in: AFGRW, 1995.
the speed of recognition, we propose a new key frames [12] E. Ohn-Bar, M. M. Trivedi, Hand gesture recognition in real
time for automotive interfaces: A multimodal vision-based ap-
extraction method based on image entropy and density proach and evaluations, IEEE TITS 15 (6) (2014) 2368–2377.
clustering, which can greatly reduce the redundant infor- [13] E. Ohn-Bar, M. M. Trivedi, The power is in your hands: 3d
mation of original video. Moreover, we further propose an analysis of hand gestures in naturalistic video, in: CVPRW,
efficient feature fusion strategy which combines appear- 2013.
[14] S. Sathayanarayana, R. K. Satzoda, A. Carini, M. Lee, L. Sala-
ance and motion cues for robust hand gesture recogni- manca, J. Reilly, D. Forster, M. Bartlett, G. Littlewort, Towards
tion. Experimental results show that the proposed ap- automated understanding of student-tutor interactions using vi-
proach outperforms the state-of-the-art methods on the sual deictic gestures, in: CVPRW, 2014.
Cambridge (98.23% ± 0.84%) and Northwestern (96.89% ± [15] S. Sathyanarayana, G. Littlewort, M. Bartlett, Hand gestures
for intelligent tutoring systems: Dataset, techniques &
1.08%) datasets. For evaluate our method on videos from evaluation, in: ICCVW, 2013.
“the wild” with significant clutter, extraneous motion and [16] H. Tang, W. Wang, D. Xu, Y. Yan, N. Sebe, Gesturegan for
no pre-snipping, we introduce two new datasets, namely hand gesture-to-gesture translation in the wild, in: ACM MM,
HandGesture and Action3D. We achieve accuracy of 99.21%± 2018.
[17] L. Liu, L. Shao, Learning discriminative representations from
0.88% and 98.98% ± 0.65% on the HandGesture and Ac- rgb-d video data., in: IJCAI, 2013.
tion3D datasets, respectively. From the respect of the [18] M. Yu, L. Liu, L. Shao, Structure-preserving binary representa-
recognition speed, we also achieve better results than the tions for rgb-d action recognition, IEEE TPAMI 38 (8) (2016)
state-of-the-art approaches for classifying one test sequence 1651–1664.
[19] H. Tang, H. Liu, W. Xiao, Gender classification using pyramid
on the Cambridge, Northwestern, HandGesture and Ac- segmentation for unconstrained back-facing video sequences, in:
tion3D datasets. ACM MM, 2015.
[20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learn-
ing spatiotemporal features with 3d convolutional networks, in:
Acknowledgments ICCV, 2015.
[21] L. Shao, L. Liu, M. Yu, Kernelized multiview projection for
This work is partially supported by National Natural robust action recognition, Springer IJCV 118 (2) (2016) 115–
129.
Science Foundation of China (NSFC, U1613209), Shen- [22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang,
zhen Key Laboratory for Intelligent Multimedia and Vir- L. Van Gool, Temporal segment networks: Towards good prac-
tual Reality (ZDSYS201703031405467), Scientific Research tices for deep action recognition, in: ECCV, 2016.
Project of Shenzhen City (JCYJ20170306164738129). [23] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks
for human action recognition, IEEE TPAMI 35 (1) (2013) 221–
231.
[24] H. Liu, H. Tang, W. Xiao, Z. Guo, L. Tian, Y. Gao, Sequen-
References tial bag-of-words model for human action classification, CAAI
Transactions on Intelligence Technology 1 (2) (2016) 125–136.
[25] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, Y. Wang, Zero-
References shot action recognition with error-correcting output codes, in:
CVPR, 2017.
[1] C. Wang, Z. Liu, S.-C. Chan, Superpixel-based hand gesture [26] H. Liu, L. Tian, M. Liu, H. Tang, Sdm-bsm: A fusing depth
recognition with kinect depth camera, IEEE TMM 17 (1) (2015) scheme for human action recognition, in: ICIP, 2015.
29–39. [27] K. Simonyan, A. Zisserman, Two-stream convolutional net-
[2] Z. Ren, J. Yuan, J. Meng, Z. Zhang, Robust part-based hand works for action recognition in videos, in: NIPS, 2014.
gesture recognition using kinect sensor, IEEE TMM 15 (5) [28] C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-
(2013) 1110–1120. stream network fusion for video action recognition, in: CVPR,
[3] H. Hikawa, K. Kaida, Novel fpga implementation of hand sign 2016.
recognition system with som–hebb classifier, IEEE TCSVT [29] Z. Zhao, A. M. Elgammal, Information theoretic key frame se-
25 (1) (2015) 153–166. lection for action recognition., in: BMVC, 2008.
[4] G. Marin, F. Dominio, P. Zanuttigh, Hand gesture recognition [30] S. Carlsson, J. Sullivan, Action recognition by shape matching
with leap motion and kinect devices, in: ICIP, 2014. to key frames, in: Workshop on Models versus Exemplars in
[5] A. Kuznetsova, L. Leal-Taixé, B. Rosenhahn, Real-time sign Computer Vision, 2001.
language recognition using a consumer depth camera, in: IC- [31] A. Brink, Using spatial information as an aid to maximum en-
CVW, 2013. tropy image threshold selection, Elsevier PRL 17 (1) (1996)
[6] Y. Yao, Y. Fu, Contour model based hand-gesture recognition 29–36.
using kinect sensor, IEEE TCSVT 24 (11) (2014) 1935–1944.
10
[32] B. S. Min, D. K. Lim, S. J. Kim, J. H. Lee, A novel method of entation images., in: BMVC, 2005.
determining parameters of clahe based on image entropy, Inter- [57] J. C. Niebles, H. Wang, L. Fei-Fei, Unsupervised learning of
national Journal of Software Engineering and Its Applications human action categories using spatial-temporal words, Springer
7 (5) (2013) 113–120. IJCV 79 (3) (2008) 299–318.
[33] X. Wang, C. Chen, Ship detection for complex background sar [58] T.-K. Kim, R. Cipolla, Canonical correlation analysis of video
images based on a multiscale variance weighted image entropy volume tensors for action categorization and detection, IEEE
method, IEEE Geoscience and Remote Sensing Letters 14 (2) TPAMI 31 (8) (2009) 1415–1428.
(2017) 184–187. [59] L. Liu, L. Shao, Synthesis of spatio-temporal descriptors for
[34] L. Shao, L. Ji, Motion histogram analysis based key frame ex- dynamic hand gesture recognition using genetic programming,
traction for human action/activity representation, in: CRV, in: FGW, 2013.
2009. [60] Y. M. Lui, J. R. Beveridge, M. Kirby, Action classification on
[35] A. Rodriguez, A. Laio, Clustering by fast search and find of product manifolds, in: CVPR, 2010.
density peaks, Science 344 (6191) (2014) 1492–1496. [61] Y. M. Lui, J. R. Beveridge, Tangent bundle for human action
[36] S. K. Kuanar, R. Panda, A. S. Chowdhury, Video key frame ex- recognition, in: FG, 2011.
traction through dynamic delaunay clustering with a structural [62] S.-F. Wong, T.-K. Kim, R. Cipolla, Learning motion categories
constraint, Elsevier JVCIP 24 (7) (2013) 1212–1227. using both semantic and structural information, in: CVPR,
[37] S. E. F. De Avila, A. P. B. Lopes, A. da Luz, A. de Albu- 2007.
querque Araújo, Vsumm: A mechanism designed to produce [63] A. Sanin, C. Sanderson, M. T. Harandi, B. C. Lovell, Spatio-
static video summaries and a novel evaluation method, Elsevier temporal covariance descriptors for action and gesture recogni-
PRL 32 (1) (2011) 56–68. tion, in: WACV, 2013.
[38] R. VáZquez-Martı́N, A. Bandera, Spatio-temporal feature- [64] L. Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara, Gesture
based keyframe detection from video shots using spectral clus- recognition in ego-centric videos using dense trajectories and
tering, Elsevier PRL 34 (7) (2013) 770–779. hand segmentation, in: CVPRW, 2014.
[39] R. Panda, S. K. Kuanar, A. S. Chowdhury, Scalable video sum-
marization using skeleton graph and random walk, in: ICPR,
2014.
[40] A. Hadid, M. Pietikäinen, Combining appearance and motion
for face and gender recognition from videos, Elsevier PR 42 (11)
(2009) 2818–2827.
[41] S. D. Jain, B. Xiong, K. Grauman, Fusionseg: Learning to com-
bine motion and appearance for fully automatic segmention of
generic objects in videos, in: CVPR, 2017.
[42] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, Y. Zhuang,
Video question answering via gradually refined attention over
appearance and motion, in: ACM MM, 2017.
[43] A. Oliva, A. Torralba, Modeling the shape of the scene: A holis-
tic representation of the spatial envelope, Springer IJCV 42 (3)
(2001) 145–175.
[44] R. Arandjelovic, A. Zisserman, Three things everyone should
know to improve object retrieval, in: CVPR, 2012.
[45] H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust
features, in: ECCV, 2006.
[46] N. Dalal, B. Triggs, Histograms of oriented gradients for human
detection, in: CVPR, 2005.
[47] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-
scale and rotation invariant texture classification with local bi-
nary patterns, IEEE TPAMI 24 (7) (2002) 971–987.
[48] Z. Guo, L. Zhang, D. Zhang, A completed modeling of local
binary pattern operator for texture classification, IEEE TIP
19 (6) (2010) 1657–1663.
[49] G. Zhao, M. Pietikainen, Dynamic texture recognition using
local binary patterns with an application to facial expressions,
IEEE TPAMI 29 (6) (2007) 915–928.
[50] P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor
and its application to action recognition, in: ACM MM, 2007.
[51] N. H. Dardas, N. D. Georganas, Real-time hand gesture detec-
tion and recognition using bag-of-features and support vector
machine techniques, IEEE TIM 60 (11) (2011) 3592–3607.
[52] T.-K. Kim, K.-Y. K. Wong, R. Cipolla, Tensor canonical corre-
lation analysis for action classification, in: CVPR, 2007.
[53] X. Shen, G. Hua, L. Williams, Y. Wu, Dynamic hand gesture
recognition: An exemplar-based approach from motion diver-
gence fields, Elsevier IVC 30 (3) (2012) 227–235.
[54] M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander, Optics:
ordering points to identify the clustering structure, ACM Sig-
mod Record 28 (2) (1999) 49–60.
[55] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based
algorithm for discovering clusters in large spatial databases with
noise., in: KDD, 1996.
[56] S.-F. Wong, R. Cipolla, Real-time interpretation of hand mo-
tions using a sparse bayesian classifier on motion gradient ori-
11