Skip to main content
Log in

Multi-scale residual network model combined with Global Average Pooling for action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human Action Recognition is a research hotspot in the field of computer vision. However, due to the complexity of the environment and the diversity of actions, Human Action Recognition still faces many challenges. At the same time, traditional CNN has problems such as single feature scale, decreased accuracy of deep network, and excessive network parameters. Aiming at the above research problems, this paper proposes a novel residual network model based on Multi-scale Feature Fusion and Global Average Pooling. The model uses a Multi-scale Feature Fusion module to extract feature information of different scales, enriches spatial-time information. At the end of the network, Global Average Pooling is used to instead of a Fully Connected layer. Compared with a Fully Connected layer, Global Average Pooling will dilute the combination of the relative positions of different features. Therefore, the features trained by convolution are more effective. In addition, Global Average Pooling can realize direct mapping between output channels and feature categories to reduce excessive model parameters. The model in this paper is verified on the UT-interaction dataset, UCF11 (YouTube Action dataset), UCF101 dataset and CAVIAR dataset. The results show that compared with the state-of-the-art approaches, this approach has high recognition accuracy and excellent robustness, and has excellent performance on datasets with complex backgrounds and diverse action categories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Afsar P, Cortez P, Santos H (2015) Automatic visual detection of human behavior: a review from 2000 to 2014. Expert Syst Appl 42(20):6935–6956

    Article  Google Scholar 

  2. Ben Mabrouk A, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 91:480–491

    Article  Google Scholar 

  3. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  4. Caruccio L, Polese G, Tortora G, Iannone D (2019) EDCAR: a knowledge representation framework to enhance automatic video surveillance. Expert Syst Appl 131:190–207

    Article  Google Scholar 

  5. Chen C, Liu Y (2019) Interaction recognition based on improved sum product networks. Comput Technol Dev 29(10):157–163

    Google Scholar 

  6. Chen P, Lin C, Schölkopf B (2005) A tutorial on v-support vector machines. Appl Stoch Model Bus Ind 21(2):111–136

    Article  MathSciNet  Google Scholar 

  7. Deng L, Yu D (2014) Deep learning: methods and applications. Found Trends Signal Process 7:3–4

    Article  MathSciNet  Google Scholar 

  8. Diba A, Fayyaz M, Sharma V, Karami A, Yousefzadeh R (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv preprint http://arXiv.org/1711.08200

  9. Fan Z, Xu Z, Lin T, Su H (2019) Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans Multimed 21(2):363–374

    Article  Google Scholar 

  10. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: 2016 Annual Conference on Neural Information Processing Systems (NIPS), pp 3476–3484

  11. Feichtenhofer C, Pinz A, Wildes R (2017) Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7445–7454

  12. Robert F, Jose SV, James C (2001) Caviar: context aware vision using image-based active recognition. http://homepages.inf.ed.ac.uk/rbf/CAVIAR/

  13. Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream LSTM: a deep fusion framework for human action recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 177–186.

  14. Guo W, Cao F (2019) Hyperspectral image classification based on 3D multi-scale feature fusion residual network. Pattern Recognit Artif Intell 32(10):882–891

    Google Scholar 

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778

  16. Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. Comput Sci 3(4):212–223

    Google Scholar 

  17. Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4700–4708

  18. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 2015 32nd International Conference on Machine Learning (PMLR), vol 37, pp 448–456.

  19. Javidani A, Mahmoudi-Aznaveh A (2018) A unified method for first and third person action recognition. In 2018 Iranian Conference on Electrical Engineering (ICEE), pp 1629–1633

  20. Javidani A, Mahmoudi-Aznaveh A (2018) Learning representative temporal features for action recognition. arXiv preprint http://arXiv.org/1802.06724

  21. Kong Y, Fu Y (2014) Modeling supporting regions for close human interaction recognition. In: 2014 European Conference on Computer Vision (ECCV), pp 29–44

  22. Kong Y, Jia Y, Fu Y (2014) Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans Pattern Anal Mach Intell (T-PAMI) 36(9):1775–1788

    Article  Google Scholar 

  23. Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: 2014 European Conference on Computer Vision (ECCV), pp 596–611

  24. Lan T, Chen T, Savarese S (2014) A hierarchical representation for future action prediction. In: 2014 European Conference on Computer Vision (ECCV), pp 689–704

  25. Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

    Article  Google Scholar 

  26. Li Y, Hu H, Zhu Z, Zhou G (2020) SCANet: sensor-based continuous authentication with two-stream convolutional neural networks. ACM Trans Sens Netw 16(3):1–27

    Article  Google Scholar 

  27. Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56

    Article  Google Scholar 

  28. Lin M, Chen Q, Yan S (2014) Network in network. In: 2014 International Conference on Learning Representations (ICLR). arXiv preprint http://arXiv.org/1312.4400

  29. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the Wild. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1996–2003

  30. Michalakis K, Aliprantis J, Caridakis G (2018) Visualizing the internet of things: naturalizing human-computer interaction by incorporating AR features. IEEE Consum Electron Mag 7(3):64–72

    Article  Google Scholar 

  31. Pan Y, Xu J, Wang M, Ye J, Xu J (2019) Compressing recurrent neural networks with tensor ring for action recognition. In: 2019 AAAI Conference on Artificial Intelligence (AAAI), vol 33, pp 4683–4690

  32. Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1–8

  33. Pham HH, Khoudour L, Crouzil A, Zegers P, Velastin SA (2017) Learning and recognizing human action from skeleton movement with deep residual neural networks. arXiv preprint http://arXiv.org/1803.07780

  34. Qinghua R, Renjie H (2018) Multi-scale deep encoder-decoder network for salient object detection. Neurocomputing 316:95–104

    Article  Google Scholar 

  35. Ryoo M, Aggarwal J (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html

  36. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  37. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: 2014 International Conference on Neural Information Processing Systems (NIPS), pp 568–576.

  38. Soomro K, Zamir A, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint http://arXiv.org/1212.0402

  39. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv preprint http://arXiv.org/1602.07261

  40. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2818–2826.

  41. Szegedy C, Wei L, Jia Y, Sermanet P, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9

  42. Trabelsi R, Varadarajan J, Zhang L, Jabri I, Moulin P (2019) Understanding the dynamics of social interactions: a multi-modal multi-view. ACM Trans Multimed Comput Commun Appl 15(1):1–16

    Article  Google Scholar 

  43. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp 4489–4497

  44. Tran D, Ray J, Shou Z, Chang S, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv preprint http://arXiv.org/1708.05038

  45. Tran D, Wang H, Torresani L, Ray J, Lecun Y (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6450–6459

  46. Wang L, Zhao X, Liu Y (2018) Skeleton feature fusion based on multi-stream LSTM for action recognition. IEEE Access 6:50788–50800

    Article  Google Scholar 

  47. Wang D, Zhao G, Li G, Deng L, Wu Y (2020) Compressing 3DCNNs based on tensor train decomposition. Neural Netw 131:215–230

    Article  Google Scholar 

  48. Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3633–3642

  49. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: 2016 European Conference on Computer Vision (ECCV), pp 20–36.

  50. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5987–5995

  51. Xin R, Jiang Z, Shao Y (2020) Complex network classification with convolutional neural network. Tsinghua Sci Technol 25(4):447–457

    Article  Google Scholar 

  52. Xu W, Miao Z, Zhang X, Tian Y (2017) A hierarchical spatio-temporal model for human Behavior recognition. IEEE Trans Multimed 19(7):1494–1509

    Article  Google Scholar 

  53. Zhang S, Wei Z, Nie J, Huang L, Zhen L (2017) A review on human activity recognition using vision-based method. J Healthcare Eng 3:1–31

    Google Scholar 

  54. Zhang J, Shum HPH, Han J, Shao L (2018) Action recognition from arbitrary views using transferable dictionary learning. IEEE Trans Image Process 10:4709–4723

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No: 62066036) and the National Natural Science Foundation of China (Grant No: 61663036).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianjun Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Han, Y., Zhang, M. et al. Multi-scale residual network model combined with Global Average Pooling for action recognition. Multimed Tools Appl 81, 1375–1393 (2022). https://doi.org/10.1007/s11042-021-11435-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11435-5

Keywords

Navigation