A Sign Language Recognition Framework Based on Cross-Modal Complementary Information Fusion

Jiangtao Zhang,Qingshan Wang,Qi Wang
DOI: https://doi.org/10.1109/tmm.2024.3377095
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Sign language recognition (SLR) can connect the hearing-impaired and able-bodied communities. The SLR works through multiple modalities of co-action, which has garnered attention. However, these methods are much less effective or even fail in recognition when confronted with missing modalities. Therefore, this paper proposes MMSLR, a multimodal SLR framework with cross-modal complementary information. The framework comprises three key components: the cross-modal information complementation (CMIC) module, the fusion and prediction module (FPM), and the sign language recognition module (SLRM). The CMIC module is designed with multilayer, multi-view spatial-temporal detectors to observe different modality features in both temporal and spatial dimensions. Additionally, it utilizes co-training to achieve complementary information among multi-modalities. The FPM integrates crossmodal attention with Canberra distance to eliminate inter-modal redundant information while fusing multimodal features. The SLRM constructed based on Transformer fuses partially obtained modalities from CMIC through bidirectional cross-channel attention. Teacher-Student pairs are constructed to transfer fullmodal features from FPM to the above fused modality features. Moreover, experimental results on the provided MM-Sentence and publicly available OH-Sentence, TH-Sentence and USTCCSL datasets demonstrate that MMSLR achieves state-of-the-art performance.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?