Abstract:In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length style embedding extracted from source speech to model the speaking style of source speech, which is insufficient to achieve comprehensive style modeling and target speaker timbre preservation. Inspired by the style's multi-scale nature of human speech, a multi-scale style modeling method for the VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the speaking style of source speech from different levels. To effectively convey the speaking style and meanwhile prevent timbre leakage from source speech to converted speech, each level's style is modeled by specific representation. Specifically, prosodic features, pre-trained ASR model's bottleneck features, and features extracted by a model trained with a self-supervised strategy are adopted to model the frame, local, and global-level styles, respectively. Besides, to balance the performance of source style modeling and target speaker timbre preservation, an explicit constraint module consisting of a pre-trained speech emotion recognition model and a speaker classifier is introduced to MSM-VC. This explicit constraint module also makes it possible to simulate the style transfer inference process during the training to improve the disentanglement ability and alleviate the mismatch between training and inference. Experiments performed on the highly expressive speech corpus demonstrate that MSM-VC is superior to the state-of-the-art VC methods for modeling source speech style while maintaining good speech quality and speaker similarity.

Mel-S3R: Combining Mel-spectrogram and self-supervised speech representation with VQ-VAE for any-to-any voice conversion

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

S2VC - A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Voice Conversion towards Arbitrary Speakers With Limited Data.

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms

Self-Supervised Representations for Singing Voice Conversion