Any-to-Any Voice Conversion With Multi-Layer Speaker Adaptation and Content Supervision

Xuexin Xu,Liang Shi,Xunquan Chen,Pingyuan Lin,Jie Lian,Jinhui Chen,Zhihong Zhang,Edwin R. Hancock
DOI: https://doi.org/10.1109/TASLP.2023.3306716
2023-01-01
Abstract:Any-to-any voice conversion can be performed among arbitrary speakers, even with a single reference utterance. Many related studies have demonstrated that it can be effectively implemented by speech representation disentanglement. However, most existing solutions fuse the speaker representations into the content features globally without considering their distribution difference. Additionally, in the any-to-any scenario, there is no effective method ensuring the consistency of linguistic content without text transcription or additional information extracted from additional modules (e.g., automatic speech recognition). Hence, to alleviate the above problems, this paper proposes SACS-VC, a novel any-to-any voice conversion method that combines two principal modules: Speaker Adaptation and Content Supervision. Specifically, we rearrange the timbre representations according to the content distribution using a temporal attention mechanism to obtain finer-grained speaker timbre information for each content feature. Meanwhile, we associate the converted outputs and source utterances directly to supervise the consistency of the semantic content in an unsupervised manner. This is achieved using contrastive learning based on the corresponding and non-corresponding locations of content features. It should be noted that SACS-VC can be implemented using a non-parallel speech corpus without any pertaining. The experimental results demonstrate that the proposed method outperforms current state-of-the-art any-to-any voice conversion systems in objective and subjective evaluation settings.
What problem does this paper attempt to address?