Abstract:Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt. Codes, models and demos are at: <a class="link-external link-https" href="https://github.com/thuhcsi/VoxInstruct" rel="external noopener nofollow">this https URL</a>.

Bi-level Codebook Based Speech-driven Visual-speech Synthesis System

Text-To-Visual Speech in Chinese Based on Data-Driven Approach

Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

Text-Independent Voice Conversion Based on State Mapped Codebook

A novel voice conversion system based on codebook mapping with phoneme-tied weighting.

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Codebook Sharing in Multi-Stage Vector Quantization

Codebook Enhancement of Vlad Representation for Visual Recognition.

Visual-Aware Text-to-Speech

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Emotional Speech Synthesis Based on Improved Codebook Mapping Voice Conversion

Dynamic Audio-Visual Mapping using Fused Hidden Markov Model Inversion Method

A Review of Text-to-Visual Speech Synthesis

LG-VQ: Language-Guided Codebook Learning

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling