Improving Monaural Speech Enhancement by Mapping to Fixed Simulation Space With Knowledge Distillation

Xinmeng Xu
DOI: https://doi.org/10.1109/lsp.2024.3355746
2024-02-02
IEEE Signal Processing Letters
Abstract:Monaural speech enhancement (SE) is a versatile and cost-effective approach that leverages recordings from a single microphone. However, it falls short of multi-channel SE due to the absence of spatial cues. These cues, present in multi-channel recordings, aid in distinguishing speech from noise more effectively. To bridge this gap, we introduce a method for mapping monaural speech into a fixed simulation space. Here, single-channel recordings are transformed into a predefined binaural format, enhancing the differentiation between target speech and noise components. This is achieved through knowledge distillation, enabling the monaural SE model to learn simulated binaural speech features from a pre-trained binaural SE model. It is important to note that we use a single type of binaural room impulse response and the monaural input of the student to simulate binaural speech. This way, our approach bypasses the paradox of generating virtual spatial information from monaural speech, while still benefiting from the spatial cues of binaural speech. Rigorous experiments demonstrate the effectiveness of our proposed method, showcasing its superior performance compared to recent monaural SE techniques in terms of PESQ and STOI scores.
engineering, electrical & electronic
What problem does this paper attempt to address?