Abstract:Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM - Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the coarse acoustic modeling.

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations Without Target Language Data.

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder

Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Improved acoustic word embeddings for zero-resource languages using multilingual transfer

Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Multilingual acoustic word embedding models for processing zero-resource languages

Unified Video-Language Pre-training with Synchronized Audio

Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks

Applying Wav2vec2.0 to Speech Recognition in Various Low-resource Languages

BrainTalker: Low-Resource Brain-to-Speech Synthesis with Transfer Learning using Wav2Vec 2.0

LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval