Transformer-Based Multiview Deep Feature Learning for Action Recognition in Depth Videos

Hanbo Wu,Xin Ma,Yibin Li
DOI: https://doi.org/10.2139/ssrn.4299970
2022-01-01
Abstract:Transformer architectures have recently attracted increasing attention and achieved remarkable success in video action recognition field. However, almost all existing transformer-based methods are applied to RGB video data from a single view, which limits the generalization performance of the transformer model. Human action recognition in depth videos is an important research direction since depth data is not only invariant to illumination and color variations, but also provides reliable 3D geometric information of the body silhouette. In this paper, we extend the transformer architecture to depth video action recognition and propose a transformer-based multiview deep feature learning framework. It mainly consists of an intra-view self-attention encoder module (ISEM) and a cross-view feature fusion module (CFFM). Specially, a depth video is first projected into three orthogonal views to construct multiview depth dynamic volumes that can describe the 3D spatiotemporal evolution of human actions. We feed multiview depth dynamic volumes into 3D CNN for spatiotemporal feature modeling. Based on deep convolutional feature maps of three views, the ISEM learns long-range spatiotemporal dependencies within each view. The CFFM performs inter-view interactions and then integrates cross-view features together to generate a global feature representation, which is finally used for action recognition. Extensive experiments conducted on two large-scale datasets demonstrate the effectiveness of our method to improve the recognition performance in depth videos.
What problem does this paper attempt to address?