Speech Dereverberation with a Reverberation Time Shortening Target

Rui Zhou,Wenye Zhu,Xiaofei Li
2023-06-06
Abstract:This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation and meanwhile maintains the exponential decaying property of reverberation, which will ease the network training, and thus reduce signal distortion caused by the prediction error. Moreover, this work experimentally study to adapt our previously proposed FullSubNet speech denoising network to speech dereverberation. Experiments show that RTS is a more suitable learning target than direct-path speech and early reflections, in terms of better suppressing reverberation and signal distortion. FullSubNet is able to achieve outstanding dereverberation performance.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of speech dereverberation, particularly the challenges in single-channel scenarios. Severe late reverberation can significantly degrade the quality and intelligibility of speech and may lead to a decline in the performance of downstream tasks such as automatic speech recognition (ASR). Traditional dereverberation methods are based on statistical models and signal processing algorithms, while in recent years, deep neural networks (DNN) have made significant progress in solving this problem. The paper proposes a new learning objective, namely the Reverberation Time Shortening (RTS) objective, for speech dereverberation. Traditional methods typically use the direct path speech or include some early reflections as the learning target, but this abrupt truncation of reverberation may not be suitable for network training and may lead to large prediction errors and signal distortion. In contrast, the proposed RTS objective not only suppresses reverberation but also maintains the characteristic of exponential decay of reverberation, which helps to simplify the network training process and reduce signal distortion caused by prediction errors. Additionally, the authors experimentally applied the previously proposed FullSubNet speech denoising network to the speech dereverberation task. Experimental results show that compared to direct path speech and early reflections, RTS is a more suitable learning objective as it better suppresses reverberation and signal distortion; the FullSubNet network also achieves excellent performance in the speech dereverberation task.