ScaleFormer: Transformer-based Speech Enhancement in the Multi-Scale Time Domain

Tianci Wu,Shulin He,Hui Zhang,XueLiang Zhang
DOI: https://doi.org/10.1109/apsipaasc58517.2023.10317310
2023-01-01
Abstract:Processing speech at multiple temporal scales greatly improves the performance of automatic speech recognition, but its effect has not been fully exploited in speech enhancement tasks. In this study, we propose a novel Transformer-based neural network termed ScaleFormer, which analyzes speech at multiple temporal resolutions. In ScaleFormer, we utilize an encoder that employs multi-scale convolution to extract different temporal scale features. Then, an intra-scale transformer is used to extract the representation within each scale. After obtaining the output of the intra-scale transformer, an inter-scale transformer is used to model the relationship between multiple scales. All transformer block in ScaleFormer is designed with a dual-path framework to learn short and long-term dependencies. We conduct the experiments on the WSJ0 SI-84 corpus. Experimental results show that our approach outperforms previous representative systems in terms of objective metrics.
What problem does this paper attempt to address?