Dual-path transformer-based network with equalization-generation components prediction for flexible vibrational sensor speech enhancement in the time domain

Changyan Zheng,Liguo Xu,Xiaohu Fan,Jibin Yang,Junyi Fan,Xian Huang
DOI: https://doi.org/10.1121/10.0010316
Abstract:The flexible vibrational sensor (FVS) has the potential to become a popular wearable communication device because of its natural noise shielding characteristics and soft materials. However, FVS speech faces a severe loss of frequency components. To improve speech quality, a time-domain neural network model based on the dual-path transformer combined with equalization-generation components prediction (DPT-EGNet) is proposed. More specifically, the DPT-EGNet consists of five modules, namely the pre-processing module, dual-path transformer module, equalization module, generation module, and post-processing module. The dual-path transformer module is leveraged to extract the local and global contextual relationship of long-term speech sequences, which is extremely beneficial for inferring the missing components. The equalization and generation modules are designed according to the characteristics of FVS speech, which further improve the speech quality by simulating the inversion process of the speech distortion. The experimental results demonstrate that the proposed model effectively improves the quality of FVS speech; the average perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and composite measure for overall speech quality (COVL) scores of three males and three females are relatively increased by 64.19%, 29.63%, and 101.37%, which is superior to other baseline models developed in different domains. The proposed model also has significantly lower complexity than the others.
What problem does this paper attempt to address?