Abstract:Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at <a class="link-external link-https" href="https://github.com/BakerBunker/VecTok" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the challenges in current speech generation models regarding speech quality and task generalization. Specifically, the authors propose a new framework called Vec-Tok Speech, which combines continuous vectors (speech vectors) and discrete tokens (semantic tokens) to achieve high-fidelity and highly expressive speech generation. #### Main Objectives: 1. **High-Fidelity Speech Reconstruction**: Achieve high-quality speech reconstruction by extracting speech vectors that contain rich acoustic details. 2. **Language Modeling**: Capture the linguistic content of speech through semantic tokens, facilitating language modeling. 3. **Multi-Task Applicability**: The framework can be applied to various downstream tasks such as zero-shot voice conversion (VC), zero-shot speaker style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification. 4. **Flexible Expressiveness**: Enable diverse speech synthesis through different speech prompts, including speaking style and speaker timbre. #### Core Contributions: 1. **Vec-Tok Codec**: A novel speech codec that combines high fidelity with low bit rate. 2. **Vec-Tok Speech Framework**: Utilizes language models to handle various speech generation tasks and introduces Byte Pair Encoding (BPE) to reduce token length and improve model performance. 3. **Multi-Speaker, Multi-Style TTS**: Demonstrates the framework's capability to perform multi-speaker, multi-style TTS in zero-shot scenarios. Through experimental validation, Vec-Tok Speech, trained on 50,000 hours of speech data, outperforms other state-of-the-art models.

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

FreeCodec: A disentangled neural speech codec with fewer tokens

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

Fewer-token Neural Speech Codec with Time-invariant Codes

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

RepCodec: A Speech Representation Codec for Speech Tokenization

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Low-latency Speech Enhancement via Speech Token Generation

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding