Abstract:Abstract Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.

A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

Low-latency Real-time Voice Conversion on CPU

Improving Recognition-Synthesis Based Any-to-one Voice Conversion with Cyclic Training

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Adversarial Post-Processing of Voice Conversion Against Spoofing Detection

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

Residual Speaker Representation for One-Shot Voice Conversion

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion

Voice Conversion With Just Nearest Neighbors

A noise-robust voice conversion method with controllable background sounds

Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

Voice Conversion towards Arbitrary Speakers With Limited Data.

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer.