Dynamical semantic enhancement network for continuous sign language recognition

Suyang Wang,Leming Guo,Wanli Xue
DOI: https://doi.org/10.1007/s00530-024-01505-7
IF: 3.9
2024-10-12
Multimedia Systems
Abstract:In the field of sign language recognition, effective interpretation of semantic information, which is primarily conveyed through facial and hand gestures, poses significant challenges. Previous methods often struggle to simultaneously capture semantic areas and accurately assess the varying importance of different motion frames, which hampers recognition accuracy. We propose the Dynamical Semantic Enhancement (DSE) Network which integrates the Long-Short Dependence Attention (LSDA) and Global Interaction Conv2d (GIConv2d) to address these challenges. LSDA is designed to form long-short-range spatial dependencies by advanced large-kernel convolutions coupled with small-kernel convolutions, which effectively capture synchronous facial and hand semantic contents. Meanwhile, GIConv2d adaptively learns the motion semantic contents by dynamically generating calibrated weights, focusing on the reasoning of frame-level contributions of rapid movement frames and static frames. Our DSE achieves competitive performances on three widely used datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Additionally, visualization experiments confirm the DSE's superior capability in reinforcing semantic extraction both spatially and temporally.
computer science, information systems, theory & methods
What problem does this paper attempt to address?