Abstract:Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speaker identity, emotion, and prosody. In this work, we propose Make-A-Voice, a unified framework for synthesizing and manipulating voice signals from discrete representations. Make-A-Voice leverages a "coarse-to-fine" approach to model the human voice, which involves three stages: 1) semantic stage: model high-level transformation between linguistic content and self-supervised semantic tokens, 2) acoustic stage: introduce varying control signals as acoustic conditions for semantic-to-acoustic modeling, and 3) generation stage: synthesize high-fidelity waveforms from acoustic tokens. Make-A-Voice offers notable benefits as a unified voice synthesis framework: 1) Data scalability: the major backbone (i.e., acoustic and generation stage) does not require any annotations, and thus the training data could be scaled up. 2) Controllability and conditioning flexibility: we investigate different conditioning mechanisms and effectively handle three voice synthesis applications, including text-to-speech (TTS), voice conversion (VC), and singing voice synthesis (SVS) by re-synthesizing the discrete voice representations with prompt guidance. Experimental results demonstrate that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models. Audio samples are available at <a class="link-external link-https" href="https://Make-A-Voice.github.io" rel="external noopener nofollow">this https URL</a>

Multi-source Based Acoustic Model for Speech Synthesis.

Amplitude Spectrum Based Excitation Model For Hmm-Based Speech Synthesis

Acoustic Statistical Modeling Based Speech Synthesis Technologies

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Acoustic statistical modeling based new generation speech synthesis technology

Multi-speaker Prosodic Instance Selection for HMM-based Speech Synthesis

HMM Based TTS for Mixed Language Text.

Multi-Layer F0 Modeling for HMM-Based Speech Synthesis

Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis

MASS: Multi-task anthropomorphic speech synthesis framework

Modulation Spectrum Compensation For Hmm- Based Speech Synthesis Using Line Spectral Pairs

An Excitation Model Based On Inverse Filtering For Speech Analysis And Synthesis

Multi-Voice Singing Synthesis From Lyrics

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model

Asynchronous F0 and Spectrum Modeling for HMM-based Speech Synthesis

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

An Improved Sinusoidal Model Based Speech Analyzer and Synthesizer

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis