Cross-Modal Quantization for Co-Speech Gesture Generation

Zheng Wang,Wei Zhang,Long Ye,Dan Zeng,Tao Mei
DOI: https://doi.org/10.1109/tmm.2024.3405743
IF: 7.3
2024-10-19
IEEE Transactions on Multimedia
Abstract:Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modalities. In this work, we propose a novel Cross-modal Quantization (CMQ) to jointly learn the quantized codes for speech and gesture together. Such representation highlights the speech-gesture interaction before actually learning the complex mapping, and thus better suits the intricate mapping between speech and gesture. Specifically, the Cross-modal Quantizer jointly encodes speech and gesture as discrete codebooks, enabling better cross-modal interaction. Cross-modal Predictor subsequently utilizes the learned codebooks to autoregressively predict the next-step gesture. With cross-modal quantization, our approach yields much higher codebook usage and generates more realistic and diverse gestures in practice. Extensive experiments are conducted on both 3D and 2D datasets as well as the subjective user study, demonstrating a clear performance gain compared to several baseline models in terms of audio-visual alignment and gesture diversity. In particular, our method demonstrates a three-fold improvement in diversity compared to baseline models, while simultaneously maintaining high motion fidelity.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?