Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Yucheng Suo,Zhedong Zheng,Xiaohan Wang,Bang Zhang,Yi Yang
DOI: https://doi.org/10.1145/3648368
2024-02-16
Abstract:Sign language provides a way for differently-abled individuals to express their feelings and emotions. However, learning sign language can be challenging and time-consuming. An alternative approach is to animate user photos using sign language videos of specific words, which can be achieved using existing image animation methods. However, the finger motions in the generated videos are often not ideal. To address this issue, we propose the Structure-aware Temporal Consistency Network (STCNet), which jointly optimizes the prior structure of humans with temporal consistency to produce sign language videos. We use a fine-grained skeleton detector to acquire knowledge of body structure and introduce short-term cycle loss and long-term cycle loss to ensure the continuity of the generated video. The two losses and keypoint detector network are optimized in an end-to-end manner. Quantitative and qualitative evaluations on three widely-used datasets, namely LSA64, Phoenix-2014T, and WLASL-2000, demonstrate the effectiveness of the proposed method. We hope this work can contribute to future studies on sign language production.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?