Two-Stage Human Activity Recognition Using 2D-ConvNet.

Kamal Kant Verma; Brij Mohan Singh; H. L. Mandoria; Prachi Chauhan

doi:10.9781/ijimai.2020.04.002

Authors

Kamal Kant Verma Uttarakhand Technical University
Brij Mohan Singh College of Engineering Roorkee.
H. L. Mandoria Govind Ballabh Pant University of Agriculture and Technology
Prachi Chauhan Govind Ballabh Pant University of Agriculture and Technology

DOI:

https://doi.org/10.9781/ijimai.2020.04.002

Keywords:

Activity recognition, Monitoring, Random Forest, Convolutional Neural Network (CNN)

Supporting Agencies

We are thankful to Uttarakhand Technical University Dehradun for maintaining the research facilities for this work. We also give special thanks to Microsoft Research team to maintain UTKinect-Action Dataset repository to complete this research work. We also thanks to Dr. Pradeep Kumar Postdoctoral Fellow at UNB Canada for revising and improving this manuscript.

Abstract

There is huge requirement of continuous intelligent monitoring system for human activity recognition in various domains like public places, automated teller machines or healthcare sector. Increasing demand of automatic recognition of human activity in these sectors and need to reduce the cost involved in manual surveillance have motivated the research community towards deep learning techniques so that a smart monitoring system for recognition of human activities can be designed and developed. Because of low cost, high resolution and ease of availability of surveillance cameras, the authors developed a new two-stage intelligent framework for detection and recognition of human activity types inside the premises. This paper, introduces a novel framework to recognize single-limb and multi-limb human activities using a Convolution Neural Network. In the first phase single-limb and multi-limb activities are separated. Next, these separated single and multi-limb activities have been recognized using sequence-classification. For training and validation of our framework we have used the UTKinect-Action Dataset having 199 actions sequences performed by 10 users. We have achieved an overall accuracy of 97.88% in real-time recognition of the activity sequences.

Downloads

Download data is not yet available.

References

[1] L. Breiman, “Random forests”. Machine Learning, vol. 45, Jan. 2001 pp. 5-32.

[2] N. Haering, P.L. Venetianer and A. Lipton, “The evolution of video surveillance: an overview”, Machine Vision and Applications, vol. 19 May 2008, pp. 279-290.

[3] W. Hu, T. Tan, L. Wang and S. Maybank, “A survey on visual surveillance of object motion and behaviors”, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol 34, 2004, pp. 334-352.

[4] I.S. Kim, H.S. Choi, K.M. Yi, J.Y. Choi and S.G. Kong, “Intelligent visual surveillance a survey”, International Journal of Control, Automation, and Systems, vol. 8, 2010, pp. 1598- 6446.

[5] T.Ko, “A survey on behavior analysis in video surveillance for homeland security applications”, in: 37th IEEE Applied Imagery Pattern Recognition Workshop, 2008 (AIPR 08), October 2008, pp. 1-8.

[6] O.P. Popoola and K. Wang, “Video-based abnormal human behavior recognition review”, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 42, 2012, pp. 865-878.

[7] K. K. Verma, P. Kumar and A. Tomar, “Analysis of moving object detection and tracking in video surveillance system”, In 2015 IEEE 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 1758-1762.

[8] P. Geetha, V. Narayanan, “A survey of content-based video retrieval”, Journal of Computer Science, vol. 4 2008, pp. 474-486.

[9] J. Zhang, Z. Liu, (2008, June). “Detecting abnormal motion of pedestrian in video”, In IEEE International Conference in Information and Automation ICIA, 2008, pp.81-85.

[10] C. C. Hsieh, and S. S. Hsu, “A simple and fast surveillance system for human tracking and behavior analysis.” In Third IEEE International Conference on Signal-Image Technologies and Internet Based System SITIS’07, December 2007, pp. 812-818.

[11] J.K. Aggarwal and Q. Cai, “Human motion analysis: a review”, Computer Vision and Image Understanding, vol. 73, 1999, pp. 428-440.

[12] J.K. Aggarwal, Q. Cai, W. Liao and B. Sabata, “Non rigid motion analysis: articulated and elastic motion”, Computer Vision and Image Understanding, vol. 70, 1998, pp. 142-156.

[13] X. Ji and H. Liu, “Advances in view-invariant human motion analysis: a review”, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 40, 2010, pp. 13-24.

[14] R. Poppe, “Vision-based human motion analysis: an overview”, Computer Vision and Image Understanding, vol. 108, 2007, pp. 4-18.

[15] W. Wei and A. Yunxiao, “Vision-based human motion recognition: a survey”, in: Second International Conference on Intelligent Networks and Intelligent Systems, 2009 (ICINIS09), November 2009, pp. 386-389.

[16] H. Buxton, “Learning and understanding dynamic scene activity: a review”, Image and Vision Computing, vol. 21, 2003, pp. 125-136.

[17] M. Del Rose and C. Wagner, “Survey on classifying human actions through visual sensors”, Artificial Intelligence Review, vol. 37, 2012, pp. 301-311.

[18] M. Pantic, A. Pentland, A. Nijholt and T. Huang, “Human computing and machine understanding of human behavior: a survey”, in: Proceedings of the 8th International Conference on Multimodal interfaces, ICMI 06, ACM, New York, NY, USA, 2006, ISBN 1-59593-541-X, pp. 239248.

[19] W. McNally, A. Wong, and J. McPhee, “STAR-Net: Action Recognition using Spatio-Temporal Activation Reprojection”, 16th IEEE Conference on Computer and Robot Vision (CRV), 2019, pp. 49-56.

[20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. “Large-scale video classification with convolutional neural networks”. In Proc. CVPR, 2014, pp. 1725-1732.

[21] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. “Convolutional learning of spatio-temporal features”. In Proc.ECCV, 2010, pp. 140-153.

[22] Tran, L. Bourdev, R. Fergus, L. Torresani, and M.Paluri. “Learning spatiotemporal features with 3D convolutional networks”. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489- 4497.

[23] L. Xue, S. Xiandong, N. Lanshun, L. Jiazhen, D. Renjie, Z. Dechen, and C. Dianhui, “Understanding and Improving Deep Neural Network for Activity Recognition” .arXiv preprint arXiv: 1805.07020, 2018.

[24] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: “A unified embedding for face recognition and clustering”. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition CVPR, 2015, pp. 815-823.

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet classification with deep convolutional neural networks”. In Advances in Neural Information Processing System (NIPS), 2012, pp. 1097-1105.

[26] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos”. In Advances in Neural Information Processing System (NIPS), 2014, pp. 568-576.

[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9.

[28] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks”, In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 648-656.

[29] L. Xia, C. C. Chen, and J. K. Aggarwal, “View invariant human action recognition using histograms of 3d joints”. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, June 2012, pp.20-27.

[30] L. Wang, Y. Xu, J. Cheng, H. Xia, J. Yin, and J. Wu, “Human action recognition by learning spatio-temporal features with deep neural networks”. IEEE access, vol. 6, pp. 17913-17922.

[31] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A.Oliva, “Learning deep features for scene recognition using places database”, in Proc. Advances in Neural Information Processing System, 2014, pp. 487-495.

[32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation”, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 580-587.

[33] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deep Face: Closing the gap to human-level performance in face verification”, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1701- 1708.

[34] J. Donahue, A. Hendricks, L. Guadarrama, S. Rohrbach, M. Venugopalan, S. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description”, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625-2634.

[35] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition”, IEEE Transaction on Pattern Analysis Machine Intelligence, vol. 35, no. 1, pp. 221-231, Jan.2010.

[36] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-convolutional descriptors”, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4305-4314.

[37] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practices for very deep two-stream convnets, CoRR”, In IEEE International Conference on Computer Vision and Pattern Recognition, July 2015, pp. 1-5.

[38] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic image networks for action recognition”, in Proc. IEEE Conference on Computer Vision and Pattern Recognition., 2016, pp.3034-3042.

[39] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition”, in Proc. IEEE Conference on Computer Vision and Pattern Recognition., 2016, pp. 1933-1941.

[40] X. Peng, L. Wang, X. Wang, and Y. Qiao, “Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice”, Computer Vision and Image Understanding, vol. 150, 2016, pp. 109-125.

[41] X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action recognition with stacked fisher vectors”. In Proc. European Conference on Computer Vision (ECCV), 2014, pages 581-595.

[42] H. Wang and C. Schmid, “Action recognition with improved trajectories, In Proc. International Conference on Computer Vision (ICCV), 2013, pp. 3551-3558.

[43] H. Wang, A. Klser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajectories”, In Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3169-3176.

[44] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance”, In Proc. European Conference onComputer Vision (ECCV), 2006, pp. 428-441.

[45] L. Sun, K. Jia, D.-Y. Yeung, and B. Shi, “Human action recognition using factorized spatio-temporal convolutional networks”, In Proc. International Conference on Computer Vision (ICCV), 2015, pp. 4597-4605.

[46] A. A. Liu, W. Z. Nie, T. Su, L. Ma, T. Hao, and Z. Yang, “Coupled hidden conditional random fields for RGB-D human action recognition.” Signal Processing, vol. 112, 2015, pp. 74-82.

[47] C. Zhao, M. Chen, J. Zhao, Q. Wang, and Y. Shen, “3D Behavior Recognition Based on Multi-Modal Deep Space-Time Learning.” Applied Sciences, vol. 9.no. 4, 2019, pp. 7-16.

[48] Vemulapalli, Raviteja, Felipe Arrate, and Rama Chellappa. “Human action recognition by representing 3d skeletons as points in a lie group.” Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588-595.

[49] P. Kumar, R. Saini, P. P. Roy, P. K. Sahu, and D. P. Dogra, “Envisioned speech recognition using EEG sensors.” Personal and Ubiquitous Computing, 2018, vol. 22, no. 1, pp. 185-199.

[50] P. Sewaiwar and Kamal Kant Verma. “Comparative study of various decision tree classification algorithm using WEKA.” International Journal of Emerging Research in Management & Technology, vol. 4 2015, pp. 2278-9359.

[51] Siirtola, Pekka, and Juha Röning. “Revisiting” Recognizing Human Activities User-Independently on Smartphones Based on Accelerometer Data”-What Has Happened Since 2012?” International Journal of Interactive Multimedia & Artificial Intelligence, 2018, vol. 5, no. 3, pp. 17-21.

[52] A. Jalal and S. Kamal. “Improved behavior monitoring and classification using cues parameters extraction from camera array images.” International Journal of Interactive multimedia and Artificial Intelligence, 2018, vol. 5, no. 3, pp. 2-18.