Abstract:Sign language translation aims at directly translating a sign language video into a natural sentence. The majority of existing methods take the video-sentence pairs labeled by multiple specific signers as training and testing samples. However, such setting does not fit in with the real-world applications. A practicable sign language translation system is supposed to provide accurate translation results for unseen signers. In this paper, we mainly attack the signer-independent setting and focus on augmenting the generalization ability of translation model. To adapt to the challenging setting, we propose a novel framework called contrastive disentangled meta-learning (CDM), which develops several improvements in both deep architecture and training mode. Specifically, based on the minimax entropy objective, a disentangled module with adaptive gated units is developed to decouple the signer-specific and task-specific representation in the encoder. Besides, we facilitate the frame-word alignments by leveraging contrastive constraints between the obtained task-specific representation and the decoding output. The disentangled and contrastive modules could provide complementary information for each other. As for the training mode, we encourage the model to perform well in the simulated signer-independent scenarios by finding the generalized learning directions in the meta-learning process. Considering that vanilla meta-learning methods utilize the multiple specific signers insufficiently, we adopt a fine-grained learning strategy that simultaneously conducts meta-learning in a variety of domain shift scenarios in each iteration. Extensive experiments on the benchmark dataset RWTH-PHOENIX-Weather-2014T(PHOENIX14T) show that CDM could achieve competitive results compared with the state-of-the-art methods.

Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation.

Contrastive Disentangled Meta-Learning for Signer-Independent Sign Language Translation.

Contrastive Learning for Sign Language Recognition and Translation.

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

MC-SLT: Towards Low-Resource Signer-Adaptive Sign Language Translation

LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

MAML is a Noisy Contrastive Learner in Classification

Towards Reliable Neural Machine Translation with Consistency-Aware Meta-Learning

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

Multilingual Multimodal Learning with Machine Translated Text

MCML: A Novel Memory-based Contrastive Meta-Learning Method for Few Shot Slot Tagging

Modal Contrastive Learning based End-to-End Text Image Machine Translation

Contrastive Vision-Language Alignment Makes Efficient Instruction Learner

A Lightweight Task-Agreement Meta Learning for Low-Resource Speech Recognition

Unified Lexical Representation for Interpretable Visual-Language Alignment

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment