ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Huadai Liu,Rongjie Huang,Xuan Lin,Wenqiang Xu,Maozong Zheng,Hong Chen,Jinzheng He,Zhou Zhao
2024-04-21
Abstract:Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch, rhythm, and energy, but also on the physical environment. In this work, we propose ViT-TTS, the first visual TTS model with scalable diffusion transformers. ViT-TTS complement the phoneme sequence with the visual information to generate high-perceived audio, opening up new avenues for practical applications of AR and VR to allow a more immersive and realistic audio experience. To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information. Experimental results demonstrate that ViT-TTS achieves new state-of-the-art results, outperforming cascaded systems and other baselines regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h), ViT-TTS achieves comparative results with rich-resource baselines.~\footnote{Audio samples are available at \url{<a class="link-external link-https" href="https://ViT-TTS.github.io/" rel="external noopener nofollow">this https URL</a>.}}
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate audio with echo effects given the text and environmental images to match the room acoustic characteristics in the target scene. Specifically, the paper focuses on the following points: 1. **Improving audio perception quality**: Traditional text - to - speech (TTS) systems mainly focus on semantics, intonation, rhythm, and energy, etc. However, the audio quality generated by these systems is also affected by the surrounding physical environment. For example, hard surfaces (such as concrete or glass) will reflect sound waves, while soft surfaces (such as carpets or curtains) will absorb sound waves. Therefore, in order to generate high - quality audio, the acoustic characteristics of the room need to be considered. 2. **Fusion of vision and audio**: In virtual reality (VR) and augmented reality (AR) applications, generating audio that matches the visual content is an important requirement. However, existing TTS systems usually do not consider visual information, which limits their applications in these fields. 3. **Data scarcity problem**: Training visual TTS models usually requires a large amount of parallel data containing text, vision, and audio, but such data resources are very limited because the workload of collecting and annotating these data is huge. To solve the above problems, the paper proposes the ViT - TTS model, which achieves innovation through the following methods: - **Vision - text fusion**: Introduce a vision - text fusion module to combine visual information with text information to generate audio that is more in line with the target environment. - **Self - supervised pre - training**: Use large - scale unlabeled data for self - supervised pre - training to alleviate the data scarcity problem. - **Diffusion transformer**: Adopt a scalable transformer architecture to enhance the capacity of the model, enabling it to more effectively capture visual scene information. The experimental results show that ViT - TTS has reached a new state - of - the - art level in terms of perception quality, and can also achieve results comparable to those of rich - resource baselines under low - resource conditions.