Text to Avatar in Multi-modal Human Computer Interface
Yiqiang Chen,Wen Gao,Zhaoqi Wang,Changshui Yang,Dalong Jiang
2003-01-01
Abstract:In this paper, we present a new text-driven avatar system, which consists of three major components, a text-to-speech (TTS) unit, a speech driven facial animation (SDFA) unit and a text-to-sign language (TTSL) unit. A new visual prosody time control model and an integrated learning framework are proposed to realize synchronization among speech synthesis, face animation and gesture animation, which is crucial for this multi-modal synthesis system. Given meaningful sentences, the text-to-sign language system combined with text-to-speech system produces visual prosody information including gesture animation parameters and timing information for text-to-speech unit. The text-to-speech system produces speech according to that timing information and some prosody rules. At last, speech will be used to drive Mpeg-4 based face animation directly with some rules for face expressions. This paper highlights synergies among audio, visual and gesture technology components. The performance of our system shows that the proposed algorithm is suitable, which greatly improves the realism of multi-model speech synthesis.