Abstract:Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been applied to downstream tasks focusing on the content information, which is suitable for VC tasks. However, a huge amount of speaker information in SSL representations degrades timbre similarity and the quality of converted speech significantly. To address this problem, we proposed a high-similarity any-to-one voice conversion method with the input of SSL representations. We incorporated adversarial training mechanisms in the synthesis module using external unannotated corpora. Two auxiliary discriminators were trained to distinguish whether a sequence of mel-spectrograms has been converted by the acoustic model and whether a sequence of content embeddings contains speaker information from external corpora. Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method, which needs a huge amount of annotated corpora for training and is applicable to improve similarity for VC methods with other SSL representations as input.

SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

XWSB: A Blend System Utilizing XLS-R and WavLM with SLS Classifier detection system for SVDD 2024 Challenge

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Refining Self-Supervised Learnt Speech Representation using Brain Activations

Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

Towards Robust Speaker Verification with Target Speaker Enhancement

LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models

Speech foundation models on intelligibility prediction for hearing-impaired listeners

SLM: Bridge the thin gap between speech and text foundation models

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

A Systematic Exploration of Joint-training for Singing Voice Synthesis

Utilizing Self-supervised Representations for MOS Prediction

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

S2VC - A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.