Speech Super-Resolution Using Parallel WaveNet

Mu Wang,Zhiyong Wu,Shiyin Kang,Xixin Wu,Jia,Dan Su,Dong Yu,Helen Meng
DOI: https://doi.org/10.1109/iscslp.2018.8706637
2018-01-01
Abstract:Audio super-resolution is the task to increase the sampling rate of a given low-resolution (i.e. low sampling rate) audio. One of the most popular approaches for audio super-resolution is to minimize the squared Euclidean distance between the reconstructed signal and the high sampling rate signal in a point-wise manner. However, such approach has intrinsic limitations, such as the regression to mean problem. In this work, we introduce a novel auto-regressive method for the speech super-resolution task, which utilizes WaveNet to model the distribution of the target high-resolution signal conditioned on the log-scale mel-spectrogram of the low-resolution signal. As an auto-regressive neural network, WaveNet uses the negative log-likelihood as the objective function, which is much more suitable for highly stochastic process such as speech waveform, instead of the Euclidean distance. We also train a parallel WaveNet to speed up the generating process to real-time. In the experiments, we perform speech super-resolution by increasing the sampling rate from 4kHz to 16kHz on the VCTK corpus. The proposed method can achieve the improvement of ∼2 dB over the base-line deep residual convolutional neural network (CNN) under the Log-Spectral Distance (LSD) metric.
What problem does this paper attempt to address?