Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN

Jing Xiao,Jiaqi Liu,Dengshi Li,Lanxin Zhao,Qianrui Wang
DOI: https://doi.org/10.1007/978-3-030-98358-1_43
2022-01-01
Abstract:Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. Many studies enhance speech intelligibility by speaking style conversion (SSC), which relies solely on the Lombard effect does not work well in strong noise interference. They also model the conversion of fundamental frequency (FO) with a straightforward linear transform and map only a very few dimensions Mel-cepstral coefficients (MCEPs). As FO and MCEPs are critical aspects of hierarchical intonation, we believe that adequate modeling of these features is essential. In this paper, we make a creative study of continuous wavelet transform (CWT) to decompose FO into ten temporal scales that describe speech at different time resolutions for effective FO conversion, and we also express MCEPs with 20 dimensions over baseline 10 dimensions for MCEPs conversion. We utilize an iMetricGAN network to optimize the speech intelligibility metrics in strong noise. Experimental results show that proposed Non-Parallel Speech Style Conversion using CWT and iMetricGAN based Cyc1eGAN (NS-CiC) method outperforms the baselines that significantly increased speech intelligibility in robust noise environments in objective and subjective evaluations.
What problem does this paper attempt to address?