Abstract:Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.

End-to-end Speech Topic Classification Based on Pre-Trained Model Wavlm

Speech Topic Classification Based on Pre-trained and Graph Networks.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features

End-To-End Topic Classification Without Asr

Speech Topic Classification Based on Multi-Scale and Graph Attention Networks

Cascaded CNN-resBiLSTM-CTC: an End-to-End Acoustic Model for Speech Recognition.

End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Exploring Model Units and Training Strategies for End-to-End Speech Recognition

Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation

End-to-end Monaural Multi-speaker ASR System Without Pretraining.

An End-to-End Speech Enhancement Framework Using Stacked Multi-scale Blocks.

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

End-to-End Speech Translation with Knowledge Distillation

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

Pre-training for Speech Translation: CTC Meets Optimal Transport

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data