Abstract:Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch, rhythm, and energy, but also on the physical environment. In this work, we propose ViT-TTS, the first visual TTS model with scalable diffusion transformers. ViT-TTS complement the phoneme sequence with the visual information to generate high-perceived audio, opening up new avenues for practical applications of AR and VR to allow a more immersive and realistic audio experience. To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information. Experimental results demonstrate that ViT-TTS achieves new state-of-the-art results, outperforming cascaded systems and other baselines regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h), ViT-TTS achieves comparative results with rich-resource baselines.~\footnote{Audio samples are available at \url{<a class="link-external link-https" href="https://ViT-TTS.github.io/" rel="external noopener nofollow">this https URL</a>.}}

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate audio with echo effects given the text and environmental images to match the room acoustic characteristics in the target scene. Specifically, the paper focuses on the following points: 1. **Improving audio perception quality**: Traditional text - to - speech (TTS) systems mainly focus on semantics, intonation, rhythm, and energy, etc. However, the audio quality generated by these systems is also affected by the surrounding physical environment. For example, hard surfaces (such as concrete or glass) will reflect sound waves, while soft surfaces (such as carpets or curtains) will absorb sound waves. Therefore, in order to generate high - quality audio, the acoustic characteristics of the room need to be considered. 2. **Fusion of vision and audio**: In virtual reality (VR) and augmented reality (AR) applications, generating audio that matches the visual content is an important requirement. However, existing TTS systems usually do not consider visual information, which limits their applications in these fields. 3. **Data scarcity problem**: Training visual TTS models usually requires a large amount of parallel data containing text, vision, and audio, but such data resources are very limited because the workload of collecting and annotating these data is huge. To solve the above problems, the paper proposes the ViT - TTS model, which achieves innovation through the following methods: - **Vision - text fusion**: Introduce a vision - text fusion module to combine visual information with text information to generate audio that is more in line with the target environment. - **Self - supervised pre - training**: Use large - scale unlabeled data for self - supervised pre - training to alleviate the data scarcity problem. - **Diffusion transformer**: Adopt a scalable transformer architecture to enhance the capacity of the model, enabling it to more effectively capture visual scene information. The experimental results show that ViT - TTS has reached a new state - of - the - art level in terms of perception quality, and can also achieve results comparable to those of rich - resource baselines under low - resource conditions.

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

DiffiT: Diffusion Vision Transformers for Image Generation

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Vision Transformer Segmentation for Visual Bird Sound Denoising

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models