Abstract:This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is suitable for parallel computations using GPUs. It is also beneficial since it enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks. Second, it achieves many-to-many conversion by simultaneously learning mappings among multiple speakers using only a single model instead of separately learning mappings between each speaker pair using a different model. This enables the model to fully utilize available training data collected from multiple speakers by capturing common latent features that can be shared across different speakers. Owing to this structure, our model works reasonably well even without source speaker information, thus making it able to handle any-to-many conversion tasks. Third, we introduce a mechanism, called the conditional batch normalization that switches batch normalization layers in accordance with the target speaker. This particular mechanism has been found to be extremely effective for our many-to-many conversion model. We conducted speaker identity conversion experiments and found that ConvS2S-VC obtained higher sound quality and speaker similarity than baseline methods. We also found from audio examples that it could perform well in various tasks including emotional expression conversion, electrolaryngeal speech enhancement, and English accent conversion.

CoMoSVC: Consistency Model-based Singing Voice Conversion

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation

FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation

LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

Robust One-Shot Singing Voice Conversion

PPG-based singing voice conversion with adversarial representation learning

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism