Mel-S3R: Combining Mel-spectrogram and self-supervised speech representation with VQ-VAE for any-to-any voice conversion

Jichen Yang,Yi Zhou,Hao Huang
DOI: https://doi.org/10.1016/j.specom.2023.05.004
IF: 2.723
2023-05-01
Speech Communication
Abstract:The self-supervised speech representation (S3R) has succeeded in many downstream tasks, such as speaker recognition and voice conversion thanks to its high-level information. Voice conversion (VC) is a task to convert the source speech into a target speaker’s voice. Though S3R features effectively encode content and speaker information, spectral features contain low-level acoustic information that is complementary to the S3R. As a result, solely relying on the S3R features for VC may not be optimal. In order to seek speech representation carrying both high-level learned information and low-level spectral details for VC, we proposed a three-level attention to combine Mel-spectrogram (Mel) and S3R, denoted as Mel-S3R. In particular, S3R features are high-level learned representations extracted by a pre-trained network with self-supervised learning. Whereas Mel is the spectral feature representing the acoustic information. Then the proposed Mel-S3R is used as the input of any-to-any VQ-VAE-based VC and the experiments are performed as a downstream task. Objective metrics and subjective listening tests have demonstrated that the proposed Mel-S3R speech representation facilitates the VC framework to achieve robust performance in terms of both speech quality and speaker similarity.
computer science, interdisciplinary applications,acoustics
What problem does this paper attempt to address?