Abstract:Abstract Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.

A Parametric Model for Voice Conversion

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

An improved method for voice conversion based on Gaussian mixture model

Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding

Towards Fine-Grained Prosody Control for Voice Conversion

Voice conversion using dynamic inter-frame features

GMM-based Voice Conversion with Explicit Modelling on Feature Transform

Voice Conversion Based on Speaker Independent Model

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Voice conversion using coefficient mapping and neural network

Towards General-Purpose Text-Instruction-Guided Voice Conversion

Voice Conversion towards Arbitrary Speakers With Limited Data.

An Improved Spectral And Prosodic Transformation Method In Straight-Based Voice Conversion

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer.

A noise-robust voice conversion method with controllable background sounds

A hybrid method to convert acoustic features for voice conversion

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction

End-to-End Voice Conversion with Information Perturbation

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature