DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation

Wencan Huang,Zhou Zhao,Jinzheng He,Mingmin Zhang
DOI: https://doi.org/10.1145/3503161.3547957
2022-01-01
Abstract:Sign Language Production (SLP) aims to translate a spoken language description to its corresponding continuous sign language sequence. A prevailing solution for this problem is in a two-staged manner: it formulates SLP as two sub-tasks, i.e., Text to Gloss (T2G) translation and Gloss to Pose (G2P) animation, with gloss annotations as pivots. Although two-staged approaches achieve better performance than their direct translation counterparts, the requirement of gloss intermediaries causes a parallel data bottleneck. In this paper, to reduce reliance on gloss annotations in two-staged approaches, we propose DualSign, a semi-supervised two-staged SLP framework, which can effectively utilize partially gloss-annotated text-pose pairs and monolingual gloss data. The key component of DualSign is a novel Balanced Multi-Modal Multi-Task Dual Transformation (BM3T-DT) method, where two well-designed models, i.e., a Multi-Modal T2G model (MM-T2G) and a Multi-Task G2P model (MT-G2P), are jointly trained by leveraging their task duality and unlabeled data. After applying BM3T-DT, we derive the expected uni-modal T2G model from the well-trained MM-T2G with knowledge distillation. Considering that the MM-T2G may suffer from modality imbalance when decoding with multiple input modalities, we devise a cross-modal balancing loss, further boosting the system's overall performance. Extensive experiments conducted on the PHOENIX14T dataset show the effectiveness of our approach in the semi-supervised setting. By training with additionally collected unlabeled data, DualSign substantially improves previous state-of-the-art SLP methods.
What problem does this paper attempt to address?