Abstract:Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM – Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the - oarse acoustic modeling through shallow fusion. Finally, a prefix LM reconstructs fine acoustic tokens from the coarse and results in the converted speech. Experiments demonstrate that LM-VC outperforms competitive systems in speech naturalness and speaker similarity.

Generative Spoken Language Modeling with Quantized Feature Enhancement

Multimodal Latent Language Modeling with Next-Token Diffusion

Text-Free Prosody-Aware Generative Spoken Language Modeling

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Generative Spoken Dialogue Language Modeling

How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Generating More Audios for End-to-End Spoken Language Understanding

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Autoregressive Speech Synthesis without Vector Quantization

Evaluating Quantized Large Language Models

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

Quantized Embedding Vectors for Controllable Diffusion Language Models

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model