Abstract:Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in sign - language retrieval tasks. Specifically, the author points out the following deficiencies in current sign - language retrieval methods: 1. **Loss of local motion details**: Traditional sign - language retrieval methods usually only encode RGB videos to obtain high - level semantic features, which causes local motion details to be submerged by a large amount of visual redundant information. 2. **High memory consumption**: Existing RGB - based sign - language retrieval methods can only use offline RGB encoders due to the huge memory cost of dense visual data embedding, resulting in sub - optimal feature representations. 3. **Imbalance between global and local information**: Sign languages usually contain global visual signals (such as body position and facial expressions) and local motion signals (such as hand movements and palm motions). However, existing RGB encoders tend to focus on extracting global visual signals and ignore local motion information, leading to robustness problems of the model among different backgrounds or signers. To solve these problems, the author proposes a new framework - **Semantically Enhanced Dual - Stream Encoder (SEDS)**. This framework supplements local motion information by introducing the pose modality and combines the RGB modality for the fusion of global and local information. The specific improvements are as follows: - **Two - stream encoder**: SEDS integrates a pose encoder and an RGB encoder to capture local motion details and global visual information respectively. - **Cross - modal attention fusion module (CGAF)**: In order to better fuse the information of the two modalities, the author proposes a cross - modal attention fusion module, which can aggregate adjacent segments with similar semantic information and enhance the focus on local information. - **Fine - grained matching objective**: To ensure the quality of the fused features, the author designs an explicitly supervised Pose - RGB fine - grained matching objective, which implicitly aligns the fine - grained similarity matrices of Pose - Text and RGB - Text by performing context matching on the fine - grained two - stream features. These improvements make the SEDS framework significantly outperform existing methods on multiple datasets, especially on datasets such as How2Sign, PHOENIX - 2014T and CSL - Daily. ### Formula summary 1. **Cross - gloss attention calculation formula**: \[ h_p^t=\sum_{i = 1}^{N}\hat{v_r}_{t,i}\cdot\frac{\exp(q_p^t\cdot\hat{k_r}_{t,i})}{\sum_{j = 1}^{N}\exp(q_p^t\cdot\hat{k_r}_{t,j})} \] where \(h_p\in\mathbb{R}^{T\times D}\) is the attention vector after cross - gloss attention calculation. 2. **Feature fusion formula**: \[ f_v = \text{MLP}([\hat{f_p},\hat{f_r}])+\hat{f_p}+\hat{f_r} \] where \([\cdot,\cdot]\) represents the concatenation operation and \(f_v\) is the fused feature. 3. **InfoNCE loss function**: \[ L_{t2k}=-\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(\tau\cdot M_{ii}^{t2k})}{\sum_{j = 1}^{B}\exp(\tau\cdot M_{ij}^{t2k})} \] \[ L_{k2t}=-\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(\tau\cdot M_{ii}^{k2t})}{\sum_{j = 1}^{B}\exp(\tau\cdot M_{ji}^{k2t})} \]

SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Sign Language Video Retrieval with Free-Form Textual Queries

Two-Stream Network for Sign Language Recognition and Translation

Dynamical semantic enhancement network for continuous sign language recognition

EvSign: Sign Language Recognition and Translation with Streaming Events

MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation

Interactive attention and improved GCN for continuous sign language recognition

SCOPE: Sign Language Contextual Processing with Embedding from LLMs

Combinational sign language recognition

Prior-aware Cross Modality Augmentation Learning for Continuous Sign Language Recognition

Improving Continuous Sign Language Recognition with Adapted Image Models

Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation

Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble

Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation

Video-Based Sign Language Recognition Without Temporal Segmentation

Sign Language Recognition with Multi-modal Features.

A Sign Language Recognition Framework Based on Cross-Modal Complementary Information Fusion

Learnable Depth-Sensitive Attention for Deep RGB-D Saliency Detection with Multi-modal Fusion Architecture Search

Self-Emphasizing Network for Continuous Sign Language Recognition

On Exploring Shape and Semantic Enhancements for RGB-X Semantic Segmentation