Multi-scale residual network model combined with Global Average Pooling for action recognition

Li, Jianjun; Han, Yu; Zhang, Ming; Li, Gang; Zhang, Baohua

doi:10.1007/s11042-021-11435-5

Multi-scale residual network model combined with Global Average Pooling for action recognition

Published: 01 October 2021

Volume 81, pages 1375–1393, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jianjun Li ORCID: orcid.org/0000-0003-3003-8344¹,
Yu Han¹,
Ming Zhang¹,
Gang Li¹ &
…
Baohua Zhang¹

880 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

Human Action Recognition is a research hotspot in the field of computer vision. However, due to the complexity of the environment and the diversity of actions, Human Action Recognition still faces many challenges. At the same time, traditional CNN has problems such as single feature scale, decreased accuracy of deep network, and excessive network parameters. Aiming at the above research problems, this paper proposes a novel residual network model based on Multi-scale Feature Fusion and Global Average Pooling. The model uses a Multi-scale Feature Fusion module to extract feature information of different scales, enriches spatial-time information. At the end of the network, Global Average Pooling is used to instead of a Fully Connected layer. Compared with a Fully Connected layer, Global Average Pooling will dilute the combination of the relative positions of different features. Therefore, the features trained by convolution are more effective. In addition, Global Average Pooling can realize direct mapping between output channels and feature categories to reduce excessive model parameters. The model in this paper is verified on the UT-interaction dataset, UCF11 (YouTube Action dataset), UCF101 dataset and CAVIAR dataset. The results show that compared with the state-of-the-art approaches, this approach has high recognition accuracy and excellent robustness, and has excellent performance on datasets with complex backgrounds and diverse action categories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified pooling based deep convolutional neural networks for human action recognition

Article 15 July 2016

Multi-modality learning for human action recognition

Article 02 March 2020

Action Recognition Using Multiple Pooling Strategies of CNN Features

Article 03 October 2018

References

Afsar P, Cortez P, Santos H (2015) Automatic visual detection of human behavior: a review from 2000 to 2014. Expert Syst Appl 42(20):6935–6956
Article Google Scholar
Ben Mabrouk A, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 91:480–491
Article Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Caruccio L, Polese G, Tortora G, Iannone D (2019) EDCAR: a knowledge representation framework to enhance automatic video surveillance. Expert Syst Appl 131:190–207
Article Google Scholar
Chen C, Liu Y (2019) Interaction recognition based on improved sum product networks. Comput Technol Dev 29(10):157–163
Google Scholar
Chen P, Lin C, Schölkopf B (2005) A tutorial on v-support vector machines. Appl Stoch Model Bus Ind 21(2):111–136
Article MathSciNet Google Scholar
Deng L, Yu D (2014) Deep learning: methods and applications. Found Trends Signal Process 7:3–4
Article MathSciNet Google Scholar
Diba A, Fayyaz M, Sharma V, Karami A, Yousefzadeh R (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv preprint http://arXiv.org/1711.08200
Fan Z, Xu Z, Lin T, Su H (2019) Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans Multimed 21(2):363–374
Article Google Scholar
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: 2016 Annual Conference on Neural Information Processing Systems (NIPS), pp 3476–3484
Feichtenhofer C, Pinz A, Wildes R (2017) Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7445–7454
Robert F, Jose SV, James C (2001) Caviar: context aware vision using image-based active recognition. http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream LSTM: a deep fusion framework for human action recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 177–186.
Guo W, Cao F (2019) Hyperspectral image classification based on 3D multi-scale feature fusion residual network. Pattern Recognit Artif Intell 32(10):882–891
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. Comput Sci 3(4):212–223
Google Scholar
Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4700–4708
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 2015 32nd International Conference on Machine Learning (PMLR), vol 37, pp 448–456.
Javidani A, Mahmoudi-Aznaveh A (2018) A unified method for first and third person action recognition. In 2018 Iranian Conference on Electrical Engineering (ICEE), pp 1629–1633
Javidani A, Mahmoudi-Aznaveh A (2018) Learning representative temporal features for action recognition. arXiv preprint http://arXiv.org/1802.06724
Kong Y, Fu Y (2014) Modeling supporting regions for close human interaction recognition. In: 2014 European Conference on Computer Vision (ECCV), pp 29–44
Kong Y, Jia Y, Fu Y (2014) Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans Pattern Anal Mach Intell (T-PAMI) 36(9):1775–1788
Article Google Scholar
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: 2014 European Conference on Computer Vision (ECCV), pp 596–611
Lan T, Chen T, Savarese S (2014) A hierarchical representation for future action prediction. In: 2014 European Conference on Computer Vision (ECCV), pp 689–704
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Article Google Scholar
Li Y, Hu H, Zhu Z, Zhou G (2020) SCANet: sensor-based continuous authentication with two-stream convolutional neural networks. ACM Trans Sens Netw 16(3):1–27
Article Google Scholar
Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56
Article Google Scholar
Lin M, Chen Q, Yan S (2014) Network in network. In: 2014 International Conference on Learning Representations (ICLR). arXiv preprint http://arXiv.org/1312.4400
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the Wild. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1996–2003
Michalakis K, Aliprantis J, Caridakis G (2018) Visualizing the internet of things: naturalizing human-computer interaction by incorporating AR features. IEEE Consum Electron Mag 7(3):64–72
Article Google Scholar
Pan Y, Xu J, Wang M, Ye J, Xu J (2019) Compressing recurrent neural networks with tensor ring for action recognition. In: 2019 AAAI Conference on Artificial Intelligence (AAAI), vol 33, pp 4683–4690
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1–8
Pham HH, Khoudour L, Crouzil A, Zegers P, Velastin SA (2017) Learning and recognizing human action from skeleton movement with deep residual neural networks. arXiv preprint http://arXiv.org/1803.07780
Qinghua R, Renjie H (2018) Multi-scale deep encoder-decoder network for salient object detection. Neurocomputing 316:95–104
Article Google Scholar
Ryoo M, Aggarwal J (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: 2014 International Conference on Neural Information Processing Systems (NIPS), pp 568–576.
Soomro K, Zamir A, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint http://arXiv.org/1212.0402
Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv preprint http://arXiv.org/1602.07261
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2818–2826.
Szegedy C, Wei L, Jia Y, Sermanet P, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9
Trabelsi R, Varadarajan J, Zhang L, Jabri I, Moulin P (2019) Understanding the dynamics of social interactions: a multi-modal multi-view. ACM Trans Multimed Comput Commun Appl 15(1):1–16
Article Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp 4489–4497
Tran D, Ray J, Shou Z, Chang S, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv preprint http://arXiv.org/1708.05038
Tran D, Wang H, Torresani L, Ray J, Lecun Y (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6450–6459
Wang L, Zhao X, Liu Y (2018) Skeleton feature fusion based on multi-stream LSTM for action recognition. IEEE Access 6:50788–50800
Article Google Scholar
Wang D, Zhao G, Li G, Deng L, Wu Y (2020) Compressing 3DCNNs based on tensor train decomposition. Neural Netw 131:215–230
Article Google Scholar
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3633–3642
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: 2016 European Conference on Computer Vision (ECCV), pp 20–36.
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5987–5995
Xin R, Jiang Z, Shao Y (2020) Complex network classification with convolutional neural network. Tsinghua Sci Technol 25(4):447–457
Article Google Scholar
Xu W, Miao Z, Zhang X, Tian Y (2017) A hierarchical spatio-temporal model for human Behavior recognition. IEEE Trans Multimed 19(7):1494–1509
Article Google Scholar
Zhang S, Wei Z, Nie J, Huang L, Zhen L (2017) A review on human activity recognition using vision-based method. J Healthcare Eng 3:1–31
Google Scholar
Zhang J, Shum HPH, Han J, Shao L (2018) Action recognition from arbitrary views using transferable dictionary learning. IEEE Trans Image Process 10:4709–4723
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No: 62066036) and the National Natural Science Foundation of China (Grant No: 61663036).

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Inner Mongolia University of Science & Technology, No. 7 Arding Street, Baotou, China
Jianjun Li, Yu Han, Ming Zhang, Gang Li & Baohua Zhang

Authors

Jianjun Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Han
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Li
View author publications
You can also search for this author in PubMed Google Scholar
Baohua Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianjun Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Han, Y., Zhang, M. et al. Multi-scale residual network model combined with Global Average Pooling for action recognition. Multimed Tools Appl 81, 1375–1393 (2022). https://doi.org/10.1007/s11042-021-11435-5

Download citation

Received: 18 October 2020
Revised: 11 August 2021
Accepted: 17 August 2021
Published: 01 October 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11042-021-11435-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale residual network model combined with Global Average Pooling for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Stratified pooling based deep convolutional neural networks for human action recognition

Multi-modality learning for human action recognition

Action Recognition Using Multiple Pooling Strategies of CNN Features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-scale residual network model combined with Global Average Pooling for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Stratified pooling based deep convolutional neural networks for human action recognition

Multi-modality learning for human action recognition

Action Recognition Using Multiple Pooling Strategies of CNN Features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation