View-Robust Neural Networks for Unseen Human Action Recognition in Videos

Jiahui Yu,Tianyu Ma,Zhaojie Ju,Hang Chen,Yingke Xu
DOI: https://doi.org/10.1109/smc53654.2022.9945457
2022-01-01
Abstract:Data-driven deep learning achieved excellent performance for human action recognition. However, unseen action recognition remains a challenge for most existing neural networks. Because the action categories, collection perspectives, and scenarios considered during data collection are limited. Compared with class-unseen action recognition, view-unseen action recognition in videos is under-explored. This paper proposes view-robust neural networks (VR-Net) to recognize unseen actions in videos. The VR-Net consists of a 3D pose estimation module, skeleton adaptive transformation neural networks, and classification modules. We first extract 3D skeleton models from the video sequence based on existing pose estimation methods. Next, we propose a skeleton representation transformation scheme and achieve it based on Convolutional Neural Networks (VR-CNN) and Graph Neural Networks (VR-GCN), resulting in the optimal skeleton representations. Futhermore, we explore an associate optimization scheme and a fused output method. We evaluate the proposed neural networks on three challenging benchmarks, i.e., NTU RGB-D dataset (NTU), Kinetics-400 dataset, and Human3.6M dataset (H3.6M). The experimental results show that view robust neural networks achieve the top performance compared to state-of-the-art RGB-based and skeleton-based works, such as 93.6% on the NTU (CV) and 94.6% on the Kinetics-400 dataset (Top-5). The proposed neural networks significantly improve the recognition performance for unseen action recognition, such as 86.8% on the H3.6M (View 2).
What problem does this paper attempt to address?