Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation.

Linjun Li,Tao Jin,Xize Cheng,Ye Wang,Wang Lin,Rongjie Huang,Zhou Zhao
DOI: https://doi.org/10.18653/v1/2023.findings-acl.699
2023-01-01
Abstract:Visual temporal-aligned translation aims to transform the visual sequence into natural words, including important applicable tasks such as lipreading and fingerspelling recognition.However, various performance habits of specific words by different speakers or signers can lead to visual ambiguity, which has become a major obstacle to the development of current methods.Considering the constraints above, the generalization ability of the translation system is supposed to be further explored through the evaluation results on unseen performers.In this paper, we develop a novel generalizable framework named Contrastive Token-Wise Meta-learning (CtoML), which strives to transfer recognition skills to unseen performers.To the best of our knowledge, employing meta-learning methods directly in the image domain poses two main challenges, and we propose corresponding strategies.First, sequence prediction in visual temporal-aligned translation, which aims to generate multiple words autoregressively, is different from the vanilla classification.Thus, we devise the token-wise diversity-aware weights for the meta-train stage, which encourages the model to make efforts on those ambiguously recognized tokens.Second, considering the consistency of word-visual prototypes across different domains, we develop two complementary global and local contrastive losses to maintain inter-class relationships and promote domainindependence.We conduct extensive experiments on the widely-used lipreading dataset GRID and the fingerspelling dataset ChicagoF-SWild, and the experimental results show the effectiveness of our proposed CtoML over existing state-of-the-art methods.
What problem does this paper attempt to address?