Low Level Descriptors Based DBLSTM Bottleneck Feature for Speech Driven Talking Avatar

Xinyu Lan,Xu Li,Yishuang Ning,Zhiyong Wu,Helen Meng,Jia,Lianhong Cai
DOI: https://doi.org/10.1109/icassp.2016.7472739
2016-01-01
Abstract:Speech is bimodal in nature. There are close correlations between the acoustic speech signals and the visual gestures such as lip movements, facial expressions and head motions. For speech driven talking avatar, how to derive more representative acoustic features from which to predict more accurate and realistic visual gestures still remains the research problem. Inspired by the promising performance of low level descriptors (LLD) in speech emotion recognition, in this work, we investigate the usage of LLD feature for the task of speech driven talking avatar. Furthermore, visual gestures also demonstrate correlations with not only context information of past or future acoustic features (e.g. anticipatory co-articulation phenomena) but also textual information (e.g. textual hints for lip movement). To incorporate such information, we also propose to use deep bidirectional long short-term memory (DBLSTM) as the bottleneck feature extractor, which can combine LLD feature with contextual information. Experimental results indicate that the proposed LLD based DBLSTM bottleneck feature outperforms the conventional spectrum related features for the task of speech driven talking avatar, and more sophisticated contextual information can further improve the performance.
What problem does this paper attempt to address?