U2-VC: One-Shot Voice Conversion Using Two-Level Nested U-structure

Chinese Academy of Sciences,Wang Hui,University of Chinese Academy of Sciences
DOI: https://doi.org/10.1186/s13636-021-00226-3
IF: 2.114
2021-01-01
EURASIP Journal on Audio Speech and Music Processing
Abstract:Voice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U-2-Net) voice conversion algorithm called U-2-VC. The U-2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U-2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.
What problem does this paper attempt to address?