Abstract:Cued Speech (CS) is an advanced visual phonetic encoding system that integrates lip reading with hand codings, enabling people with hearing impairments to communicate efficiently. CS video generation aims to produce specific lip and gesture movements of CS from audio or text inputs. The main challenge is that given limited CS data, we strive to simultaneously generate fine-grained hand and finger movements, as well as lip movements, meanwhile the two kinds of movements need to be asynchronously aligned. Existing CS generation methods are fragile and prone to poor performance due to template-based statistical models and careful hand-crafted pre-processing to fit the models. Therefore, we propose a novel Gloss-prompted Diffusion-based CS Gesture generation framework (called GlossDiff). Specifically, to integrate additional linguistic rules knowledge into the model. we first introduce a bridging instruction called \textbf{Gloss}, which is an automatically generated descriptive text to establish a direct and more delicate semantic connection between spoken language and CS gestures. Moreover, we first suggest rhythm is an important paralinguistic feature for CS to improve the communication efficacy. Therefore, we propose a novel Audio-driven Rhythmic Module (ARM) to learn rhythm that matches audio speech. Moreover, in this work, we design, record, and publish the first Chinese CS dataset with four CS cuers. Extensive experiments demonstrate that our method quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. We release the code and data at

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key challenges in **Cued Speech (CS) video generation**. Specifically: 1. **Fine - grained gesture generation**: Existing CS generation methods perform poorly in generating fine - grained hand and finger movements, especially in the case of limited data. This limits the effectiveness of CS in practical applications. 2. **Asynchronous alignment of lips and gestures**: CS gesture generation requires the simultaneous generation of hand and lip movements, and these movements need to be asynchronously aligned. Existing methods have difficulties in this regard because they usually rely on templated statistical models and complex manual pre - processing. 3. **Lack of rhythm information**: Existing CS generation methods ignore rhythm, an important paralinguistic feature, while rhythm is crucial for improving the communication effectiveness of CS. ### Solutions To address the above challenges, the authors propose a new CS gesture generation framework based on the diffusion model - **GlossDiff**. The main contributions of this framework include: 1. **Introduction of CS Gloss**: By automatically generating descriptive text (Gloss), a direct semantic connection between spoken language and CS gestures is established, thus providing more specific generation cues. 2. **Audio - driven Rhythm Module (ARM)**: A new audio - driven rhythm module is designed to learn the natural rhythm dynamics that match the speech signal, thereby improving the naturalness and synchrony of the generated gestures. 3. **Large - scale CS dataset**: The first large - scale Chinese CS dataset (MCCS) containing four CS prompters is constructed and released, providing rich data support for research. 4. **Superior performance**: Through extensive experimental verification, GlossDiff significantly outperforms existing methods on multiple metrics, especially in fine - grained gesture generation and rhythm synchrony. ### Method overview The GlossDiff framework consists of three main components: 1. **Knowledge injection module**: Convert spoken text into direct instruction text (Gloss) that describes the corresponding CS gesture actions. 2. **Rhythm module**: Extract rhythm information from audio signals to ensure that the generated gestures are in rhythm with the speech signal. 3. **Diffusion - model - based generation module**: Use the diffusion model to generate precise hand, finger, and lip movements. ### Experimental results 1. **Quantitative results**: Experimental results on the MCCS dataset show that GlossDiff outperforms existing methods on multiple metrics such as PCK, MAJE, MAD, and GAD, especially in fine - grained gesture generation and rhythm synchrony. 2. **Ablation experiments**: The effectiveness of each module is verified through ablation experiments, especially the crucial role of Gloss cues and the Gloss - CLIP module in fine - grained generation. 3. **Qualitative results**: The visualization results of the generated gestures demonstrate the effectiveness of GlossDiff in generating fine - grained gestures, especially in the changes of hand positions and finger shapes. ### Conclusion By introducing CS Gloss and the audio - driven rhythm module, GlossDiff effectively solves the problems of fine - grained generation and rhythm synchrony in CS gesture generation, providing strong support for non - handicapped communication.

Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

A Novel Interpretable and Generalizable Re-synchronization Model for Cued Speech based on a Multi-Cuer Corpus

Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models

Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition

SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion Models

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Cuing Without Sharing: A Federated Cued Speech Recognition Framework via Mutual Knowledge Distillation

A Unified Editing Method for Co-Speech Gesture Generation via Diffusion Inversion