Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Xinfa Zhu,Yuanjun Lv,Yi Lei,Tao Li,Wendi He,Hongbin Zhou,Heng Lu,Lei Xie
2023-10-12
Abstract:Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at <a class="link-external link-https" href="https://github.com/BakerBunker/VecTok" rel="external noopener nofollow">this https URL</a> .
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the challenges in current speech generation models regarding speech quality and task generalization. Specifically, the authors propose a new framework called Vec-Tok Speech, which combines continuous vectors (speech vectors) and discrete tokens (semantic tokens) to achieve high-fidelity and highly expressive speech generation. #### Main Objectives: 1. **High-Fidelity Speech Reconstruction**: Achieve high-quality speech reconstruction by extracting speech vectors that contain rich acoustic details. 2. **Language Modeling**: Capture the linguistic content of speech through semantic tokens, facilitating language modeling. 3. **Multi-Task Applicability**: The framework can be applied to various downstream tasks such as zero-shot voice conversion (VC), zero-shot speaker style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification. 4. **Flexible Expressiveness**: Enable diverse speech synthesis through different speech prompts, including speaking style and speaker timbre. #### Core Contributions: 1. **Vec-Tok Codec**: A novel speech codec that combines high fidelity with low bit rate. 2. **Vec-Tok Speech Framework**: Utilizes language models to handle various speech generation tasks and introduces Byte Pair Encoding (BPE) to reduce token length and improve model performance. 3. **Multi-Speaker, Multi-Style TTS**: Demonstrates the framework's capability to perform multi-speaker, multi-style TTS in zero-shot scenarios. Through experimental validation, Vec-Tok Speech, trained on 50,000 hours of speech data, outperforms other state-of-the-art models.