SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Longtao Jiang,Min Wang,Zecheng Li,Yao Fang,Wengang Zhou,Houqiang Li
2024-07-23
Abstract:Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in sign - language retrieval tasks. Specifically, the author points out the following deficiencies in current sign - language retrieval methods: 1. **Loss of local motion details**: Traditional sign - language retrieval methods usually only encode RGB videos to obtain high - level semantic features, which causes local motion details to be submerged by a large amount of visual redundant information. 2. **High memory consumption**: Existing RGB - based sign - language retrieval methods can only use offline RGB encoders due to the huge memory cost of dense visual data embedding, resulting in sub - optimal feature representations. 3. **Imbalance between global and local information**: Sign languages usually contain global visual signals (such as body position and facial expressions) and local motion signals (such as hand movements and palm motions). However, existing RGB encoders tend to focus on extracting global visual signals and ignore local motion information, leading to robustness problems of the model among different backgrounds or signers. To solve these problems, the author proposes a new framework - **Semantically Enhanced Dual - Stream Encoder (SEDS)**. This framework supplements local motion information by introducing the pose modality and combines the RGB modality for the fusion of global and local information. The specific improvements are as follows: - **Two - stream encoder**: SEDS integrates a pose encoder and an RGB encoder to capture local motion details and global visual information respectively. - **Cross - modal attention fusion module (CGAF)**: In order to better fuse the information of the two modalities, the author proposes a cross - modal attention fusion module, which can aggregate adjacent segments with similar semantic information and enhance the focus on local information. - **Fine - grained matching objective**: To ensure the quality of the fused features, the author designs an explicitly supervised Pose - RGB fine - grained matching objective, which implicitly aligns the fine - grained similarity matrices of Pose - Text and RGB - Text by performing context matching on the fine - grained two - stream features. These improvements make the SEDS framework significantly outperform existing methods on multiple datasets, especially on datasets such as How2Sign, PHOENIX - 2014T and CSL - Daily. ### Formula summary 1. **Cross - gloss attention calculation formula**: \[ h_p^t=\sum_{i = 1}^{N}\hat{v_r}_{t,i}\cdot\frac{\exp(q_p^t\cdot\hat{k_r}_{t,i})}{\sum_{j = 1}^{N}\exp(q_p^t\cdot\hat{k_r}_{t,j})} \] where \(h_p\in\mathbb{R}^{T\times D}\) is the attention vector after cross - gloss attention calculation. 2. **Feature fusion formula**: \[ f_v = \text{MLP}([\hat{f_p},\hat{f_r}])+\hat{f_p}+\hat{f_r} \] where \([\cdot,\cdot]\) represents the concatenation operation and \(f_v\) is the fused feature. 3. **InfoNCE loss function**: \[ L_{t2k}=-\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(\tau\cdot M_{ii}^{t2k})}{\sum_{j = 1}^{B}\exp(\tau\cdot M_{ij}^{t2k})} \] \[ L_{k2t}=-\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(\tau\cdot M_{ii}^{k2t})}{\sum_{j = 1}^{B}\exp(\tau\cdot M_{ji}^{k2t})} \]