Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model

Wentao Lei,Li Liu,Jun Wang
2024-04-30
Abstract:Cued Speech (CS) is an advanced visual phonetic encoding system that integrates lip reading with hand codings, enabling people with hearing impairments to communicate efficiently. CS video generation aims to produce specific lip and gesture movements of CS from audio or text inputs. The main challenge is that given limited CS data, we strive to simultaneously generate fine-grained hand and finger movements, as well as lip movements, meanwhile the two kinds of movements need to be asynchronously aligned. Existing CS generation methods are fragile and prone to poor performance due to template-based statistical models and careful hand-crafted pre-processing to fit the models. Therefore, we propose a novel Gloss-prompted Diffusion-based CS Gesture generation framework (called GlossDiff). Specifically, to integrate additional linguistic rules knowledge into the model. we first introduce a bridging instruction called \textbf{Gloss}, which is an automatically generated descriptive text to establish a direct and more delicate semantic connection between spoken language and CS gestures. Moreover, we first suggest rhythm is an important paralinguistic feature for CS to improve the communication efficacy. Therefore, we propose a novel Audio-driven Rhythmic Module (ARM) to learn rhythm that matches audio speech. Moreover, in this work, we design, record, and publish the first Chinese CS dataset with four CS cuers. Extensive experiments demonstrate that our method quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. We release the code and data at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve several key challenges in **Cued Speech (CS) video generation**. Specifically: 1. **Fine - grained gesture generation**: Existing CS generation methods perform poorly in generating fine - grained hand and finger movements, especially in the case of limited data. This limits the effectiveness of CS in practical applications. 2. **Asynchronous alignment of lips and gestures**: CS gesture generation requires the simultaneous generation of hand and lip movements, and these movements need to be asynchronously aligned. Existing methods have difficulties in this regard because they usually rely on templated statistical models and complex manual pre - processing. 3. **Lack of rhythm information**: Existing CS generation methods ignore rhythm, an important paralinguistic feature, while rhythm is crucial for improving the communication effectiveness of CS. ### Solutions To address the above challenges, the authors propose a new CS gesture generation framework based on the diffusion model - **GlossDiff**. The main contributions of this framework include: 1. **Introduction of CS Gloss**: By automatically generating descriptive text (Gloss), a direct semantic connection between spoken language and CS gestures is established, thus providing more specific generation cues. 2. **Audio - driven Rhythm Module (ARM)**: A new audio - driven rhythm module is designed to learn the natural rhythm dynamics that match the speech signal, thereby improving the naturalness and synchrony of the generated gestures. 3. **Large - scale CS dataset**: The first large - scale Chinese CS dataset (MCCS) containing four CS prompters is constructed and released, providing rich data support for research. 4. **Superior performance**: Through extensive experimental verification, GlossDiff significantly outperforms existing methods on multiple metrics, especially in fine - grained gesture generation and rhythm synchrony. ### Method overview The GlossDiff framework consists of three main components: 1. **Knowledge injection module**: Convert spoken text into direct instruction text (Gloss) that describes the corresponding CS gesture actions. 2. **Rhythm module**: Extract rhythm information from audio signals to ensure that the generated gestures are in rhythm with the speech signal. 3. **Diffusion - model - based generation module**: Use the diffusion model to generate precise hand, finger, and lip movements. ### Experimental results 1. **Quantitative results**: Experimental results on the MCCS dataset show that GlossDiff outperforms existing methods on multiple metrics such as PCK, MAJE, MAD, and GAD, especially in fine - grained gesture generation and rhythm synchrony. 2. **Ablation experiments**: The effectiveness of each module is verified through ablation experiments, especially the crucial role of Gloss cues and the Gloss - CLIP module in fine - grained generation. 3. **Qualitative results**: The visualization results of the generated gestures demonstrate the effectiveness of GlossDiff in generating fine - grained gestures, especially in the changes of hand positions and finger shapes. ### Conclusion By introducing CS Gloss and the audio - driven rhythm module, GlossDiff effectively solves the problems of fine - grained generation and rhythm synchrony in CS gesture generation, providing strong support for non - handicapped communication.