Abstract:Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM – Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the - oarse acoustic modeling through shallow fusion. Finally, a prefix LM reconstructs fine acoustic tokens from the coarse and results in the converted speech. Experiments demonstrate that LM-VC outperforms competitive systems in speech naturalness and speaker similarity.

LIMI-VC: A Light Weight Voice Conversion Model with Mutual Information Disentanglement

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling

MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

Zero-shot voice conversion based on feature disentanglement

Avqvc: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

Exemplar-Based Sparse Representation Of Timbre And Prosody For Voice Conversion

RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy