Abstract
Human Action Recognition is a research hotspot in the field of computer vision. However, due to the complexity of the environment and the diversity of actions, Human Action Recognition still faces many challenges. At the same time, traditional CNN has problems such as single feature scale, decreased accuracy of deep network, and excessive network parameters. Aiming at the above research problems, this paper proposes a novel residual network model based on Multi-scale Feature Fusion and Global Average Pooling. The model uses a Multi-scale Feature Fusion module to extract feature information of different scales, enriches spatial-time information. At the end of the network, Global Average Pooling is used to instead of a Fully Connected layer. Compared with a Fully Connected layer, Global Average Pooling will dilute the combination of the relative positions of different features. Therefore, the features trained by convolution are more effective. In addition, Global Average Pooling can realize direct mapping between output channels and feature categories to reduce excessive model parameters. The model in this paper is verified on the UT-interaction dataset, UCF11 (YouTube Action dataset), UCF101 dataset and CAVIAR dataset. The results show that compared with the state-of-the-art approaches, this approach has high recognition accuracy and excellent robustness, and has excellent performance on datasets with complex backgrounds and diverse action categories.
Similar content being viewed by others
References
Afsar P, Cortez P, Santos H (2015) Automatic visual detection of human behavior: a review from 2000 to 2014. Expert Syst Appl 42(20):6935–6956
Ben Mabrouk A, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 91:480–491
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Caruccio L, Polese G, Tortora G, Iannone D (2019) EDCAR: a knowledge representation framework to enhance automatic video surveillance. Expert Syst Appl 131:190–207
Chen C, Liu Y (2019) Interaction recognition based on improved sum product networks. Comput Technol Dev 29(10):157–163
Chen P, Lin C, Schölkopf B (2005) A tutorial on v-support vector machines. Appl Stoch Model Bus Ind 21(2):111–136
Deng L, Yu D (2014) Deep learning: methods and applications. Found Trends Signal Process 7:3–4
Diba A, Fayyaz M, Sharma V, Karami A, Yousefzadeh R (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv preprint http://arXiv.org/1711.08200
Fan Z, Xu Z, Lin T, Su H (2019) Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans Multimed 21(2):363–374
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: 2016 Annual Conference on Neural Information Processing Systems (NIPS), pp 3476–3484
Feichtenhofer C, Pinz A, Wildes R (2017) Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7445–7454
Robert F, Jose SV, James C (2001) Caviar: context aware vision using image-based active recognition. http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream LSTM: a deep fusion framework for human action recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 177–186.
Guo W, Cao F (2019) Hyperspectral image classification based on 3D multi-scale feature fusion residual network. Pattern Recognit Artif Intell 32(10):882–891
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. Comput Sci 3(4):212–223
Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4700–4708
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 2015 32nd International Conference on Machine Learning (PMLR), vol 37, pp 448–456.
Javidani A, Mahmoudi-Aznaveh A (2018) A unified method for first and third person action recognition. In 2018 Iranian Conference on Electrical Engineering (ICEE), pp 1629–1633
Javidani A, Mahmoudi-Aznaveh A (2018) Learning representative temporal features for action recognition. arXiv preprint http://arXiv.org/1802.06724
Kong Y, Fu Y (2014) Modeling supporting regions for close human interaction recognition. In: 2014 European Conference on Computer Vision (ECCV), pp 29–44
Kong Y, Jia Y, Fu Y (2014) Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans Pattern Anal Mach Intell (T-PAMI) 36(9):1775–1788
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: 2014 European Conference on Computer Vision (ECCV), pp 596–611
Lan T, Chen T, Savarese S (2014) A hierarchical representation for future action prediction. In: 2014 European Conference on Computer Vision (ECCV), pp 689–704
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Li Y, Hu H, Zhu Z, Zhou G (2020) SCANet: sensor-based continuous authentication with two-stream convolutional neural networks. ACM Trans Sens Netw 16(3):1–27
Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56
Lin M, Chen Q, Yan S (2014) Network in network. In: 2014 International Conference on Learning Representations (ICLR). arXiv preprint http://arXiv.org/1312.4400
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the Wild. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1996–2003
Michalakis K, Aliprantis J, Caridakis G (2018) Visualizing the internet of things: naturalizing human-computer interaction by incorporating AR features. IEEE Consum Electron Mag 7(3):64–72
Pan Y, Xu J, Wang M, Ye J, Xu J (2019) Compressing recurrent neural networks with tensor ring for action recognition. In: 2019 AAAI Conference on Artificial Intelligence (AAAI), vol 33, pp 4683–4690
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1–8
Pham HH, Khoudour L, Crouzil A, Zegers P, Velastin SA (2017) Learning and recognizing human action from skeleton movement with deep residual neural networks. arXiv preprint http://arXiv.org/1803.07780
Qinghua R, Renjie H (2018) Multi-scale deep encoder-decoder network for salient object detection. Neurocomputing 316:95–104
Ryoo M, Aggarwal J (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: 2014 International Conference on Neural Information Processing Systems (NIPS), pp 568–576.
Soomro K, Zamir A, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint http://arXiv.org/1212.0402
Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv preprint http://arXiv.org/1602.07261
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2818–2826.
Szegedy C, Wei L, Jia Y, Sermanet P, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9
Trabelsi R, Varadarajan J, Zhang L, Jabri I, Moulin P (2019) Understanding the dynamics of social interactions: a multi-modal multi-view. ACM Trans Multimed Comput Commun Appl 15(1):1–16
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp 4489–4497
Tran D, Ray J, Shou Z, Chang S, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv preprint http://arXiv.org/1708.05038
Tran D, Wang H, Torresani L, Ray J, Lecun Y (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6450–6459
Wang L, Zhao X, Liu Y (2018) Skeleton feature fusion based on multi-stream LSTM for action recognition. IEEE Access 6:50788–50800
Wang D, Zhao G, Li G, Deng L, Wu Y (2020) Compressing 3DCNNs based on tensor train decomposition. Neural Netw 131:215–230
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3633–3642
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: 2016 European Conference on Computer Vision (ECCV), pp 20–36.
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5987–5995
Xin R, Jiang Z, Shao Y (2020) Complex network classification with convolutional neural network. Tsinghua Sci Technol 25(4):447–457
Xu W, Miao Z, Zhang X, Tian Y (2017) A hierarchical spatio-temporal model for human Behavior recognition. IEEE Trans Multimed 19(7):1494–1509
Zhang S, Wei Z, Nie J, Huang L, Zhen L (2017) A review on human activity recognition using vision-based method. J Healthcare Eng 3:1–31
Zhang J, Shum HPH, Han J, Shao L (2018) Action recognition from arbitrary views using transferable dictionary learning. IEEE Trans Image Process 10:4709–4723
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No: 62066036) and the National Natural Science Foundation of China (Grant No: 61663036).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, J., Han, Y., Zhang, M. et al. Multi-scale residual network model combined with Global Average Pooling for action recognition. Multimed Tools Appl 81, 1375–1393 (2022). https://doi.org/10.1007/s11042-021-11435-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11435-5