Abstract:Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM – Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the - oarse acoustic modeling through shallow fusion. Finally, a prefix LM reconstructs fine acoustic tokens from the coarse and results in the converted speech. Experiments demonstrate that LM-VC outperforms competitive systems in speech naturalness and speaker similarity.

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Multimodal Latent Language Modeling with Next-Token Diffusion

Investigating Decoder-only Large Language Models for Speech-to-text Translation

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data

On decoder-only architecture for speech-to-text and large language model integration

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

SpeechVerse: A Large-scale Generalizable Audio Language Model

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models

Efficient Streaming LLM for Speech Recognition

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning