Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

Dai Hasegawa,Naoshi Kaneko,Shinichi Shirakawa,Hiroshi Sakuta,Kazuhiko Sumi
DOI: https://doi.org/10.1145/3267851.3267878
2018-11-05
Abstract:We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward consistencies over a long period of time. Our network regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Then, we apply combined temporal filters to smooth out the generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and marker-based motion capture to train our network. We validated our approach with a subjective evaluation and compared it against "original" human gestures and "mismatched" human gestures taken from a different utterance. The evaluation result shows that our generated gestures are significantly better than the "mismatched" gestures with respect to time consistency. The generated gesture also shows marginally significant improvement in terms of semantic consistency when compared to "mismatched" gestures.
What problem does this paper attempt to address?