A Bidirectional Separated Distillation-Based Cross-Modal Interactive Fusion Network for Skeleton-Based Action Recognition
Mingdao Wang,Xianlin Zhang,Siqi Chen,Xueming Li,Yue Zhang
DOI: https://doi.org/10.1109/jsen.2024.3491183
IF: 4.3
2024-01-01
IEEE Sensors Journal
Abstract:Human skeleton-based action recognition has drawn extensive attention recently and the research shows a trend to fuse multiple modalities. Existing methods mainly train corresponding encoders independently from different skeleton modalities (joint, bone, and motion), either sequentially or synchronously, and fuse learned features with empirical weights to earn the classification logits. However, the lack of imperative cross-modal interaction in these methods leads to the underexploitation of rich supplementary information between diverse modalities. In this paper, we propose a three-stream bidirectional cross-modal separated distillation fusion network (BCMSD-FN) to learn action variations from complementary modalities simultaneously and adaptively fuse the modalities’ features, which formulates the cross-modal interaction as a knowledge distillation problem. Specifically, we introduce the bidirectional cross-modal separated distillation objective (BCMSD) to boost the interaction by considering target class and non-target class distillation separately. With this objective, the knowledge can be interacted and bidirectionally transferred between modalities. Then, instead of simply fusing learned features with empirical weights, we propose a channel-wise feature fusion module (CWFM) to advance the feature fusion procedure of the three modalities. Finally, we instantiate the proposed method via an epidemic CLIP-like framework, i.e., using the language signal as extra supervision. Experimental results on the NTU RGB+D, NTU RGB+D 120, NW-UCLA, and UAV-Human datasets show that our approach outperforms other CLIP-like methods and achieves state-of-the-art performance.