Improving human action recognition by jointly exploiting video and WiFi clues

Jun Guo,Mei Shi,Xingwu Zhu,Wei Huang,Yi He,Weiwei Zhang,Zhanyong Tang
DOI: https://doi.org/10.1016/j.neucom.2020.11.074
IF: 6
2021-10-01
Neurocomputing
Abstract:<p>Recent years have witnessed the increasing attentions on human action recognition(HAR). Traditional methods are prone to explore the optimum spatiotemporal feature representation of human actions in video clips so as to achieve high recognition performance. However, the optical limitations, such such as inappropriate view, dim illumination and object occlusion, usually degrade video quality and affect the recognition performance a lot. Considering that wireless signals are robust against optical limitations, we thus incorporate WiFi signals with video streams for HAR. Specifically, we use WiFi Channel State Information as a compensator for video streams. A great challenge is how to effectively fuse the video and WiFi information to achieve better prediction performance. To this end, we employ convolution neural networks and statistic analysis algorithms to extract video and WiFi features respectively, and propose a novel multi-modal learning approach for video and WiFi feature fusion, where the video and WiFi features are projected to a common space by supervised learning. The experimental results indicate that the recognition precision of human actions in videos improved obviously with the aid of WiFi signals and the proposed multi-modal learning approach rivals the state of art methods.</p>
computer science, artificial intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily addresses the issue of performance degradation in Human Activity Recognition (HAR) tasks when video data is affected by optical limitations (such as insufficient lighting, occlusion, etc.). Specifically: - **Introducing WiFi signals**: Since wireless signals (like WiFi) can overcome optical limitations in videos, the researchers combine WiFi signals with video streams to improve the accuracy of HAR. - **Multimodal fusion method**: A new supervised multimodal fusion method is proposed, which fuses video features and WiFi features through a common space projection to enhance discriminative information. - **Experimental validation**: A series of experiments validate that the proposed multimodal fusion method performs better than existing methods. In summary, this paper aims to improve the accuracy of human activity recognition by combining video and WiFi signals, especially in situations where video quality is poor.