Pyramid Attention CycleGAN for Non-Parallel Voice Conversion

Xianchen Liang,Zhisong Bie,Shiwei Ma
DOI: https://doi.org/10.1109/iccc56324.2022.10065952
2022-01-01
Abstract:Non-parallel voice conversion (VC) is a voice mapping technology that uses non-parallel corpus to convert source speeches into target speeches while maintaining semantic information unchanged. Cycle-consistent adversarial network-based VC with Filling in Frames (MaskCycleGAN-VC) is proposed and generally accepted as a current benchmark method. While it solves the problem of time-frequency structures consistency, the performance of voice conversion is not satisfactory enough. There is still a large gap between target and converted voice in terms of naturalness and similarity. In addition, the performance of MaskCycleGAN-VC seriously deteriorates because of a limited amount of training data. In order to solve above problems, we propose Pyramid Attention CycleGAN (PACycleGAN) for voice conversion which integrates pyramid structure and attention mechanism. We use a method named differentiable augmentation to improve the data efficiency of GANs and make training more stable. We evaluate the performance of PACycleGAN on inter-gender and intra-gender non-parallel VC. Subjective and objective evaluations in naturalness and speaker similarity show that PACycleGAN-VC outperforms MaskCycleGAN-VC for every VC pair. 1 1 https://chenpaopao.github.io/chenpaopao/Cyclegan/index.html
What problem does this paper attempt to address?