Abstract:In this work, we further develop the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain. This paper builds on our previous work but takes a more in-depth look by conducting extensive ablation studies on model inputs and architectural design choices. We rigorously tested the generalization ability of the model to unseen noise types and distortions. We have fortified our claims through DNS-MOS measurements and listening tests. Rather than focusing exclusively on the speech denoising task, we extend this work to address the dereverberation and super-resolution tasks. This necessitated exploring various architectural changes, specifically metric discriminator scores and masking techniques. It is essential to highlight that this is among the earliest works that attempted complex TF-domain super-resolution. Our findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution. For example, in the denoising task using the Voice Bank+DEMAND dataset, CMGAN notably exceeded the performance of prior models, attaining a PESQ score of 3.41 and an SSNR of 11.10 dB. Audio samples and CMGAN implementations are available online.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the speech quality in the monaural speech enhancement task. Specifically, the author proposes a Conformer - based Metric Generative Adversarial Network (CMGAN) model for speech enhancement in the time - frequency (TF) domain. The paper mainly focuses on three main speech enhancement tasks: denoising, dereverberation, and super - resolution. These problems are very important in practical applications because they directly affect the performance in fields such as Automatic Speech Recognition (ASR), telecommunication systems, and hearing aid devices.
### Specific problems solved by the paper:
1. **Denoising**:
- The goal is to suppress the background noise \( n(t) \) and predict the desired speech \( \hat{x}(t) \) with the highest possible quality and intelligibility.
- The challenge lies in the nature of the background noise, especially non - stationary noise, which may occupy frequency bands similar to those of the desired speech.
2. **Dereverberation**:
- The goal is to suppress unwanted reflections and maintain the direct path representing the desired speech.
- The challenge lies in the influence of factors such as room size, surface characteristics, and the distance between the microphone and the speaker on the reflections.
3. **Super - resolution**:
- The goal is to reconstruct missing samples from an input signal with a low sampling frequency.
- The challenge lies in how to recover high - frequency components from a low - frequency signal, which is a difficult problem in traditional methods.
### Main contributions of the paper:
1. **Proposing a new generator architecture**:
- The generator uses a shared encoder to process the concatenated magnitude and complex parts (real part and imaginary part).
- The generator includes a dedicated mask decoder and a shared decoder, which respectively optimize the learned representations of the magnitude and complex parts.
2. **Adopting a two - stage Conformer block**:
- Utilize the Conformer block to capture time and frequency dependencies, combining the advantages of Transformer and Convolutional Neural Network (CNN).
3. **Integrating a metric discriminator**:
- The metric discriminator not only optimizes the point - to - point loss function but also contains perceptual elements, thereby improving the final speech quality.
4. **Extensive experimental verification**:
- Through DNS - MOS measurement and auditory tests, the generalization ability of the model on unseen noise types and distortions is verified.
- Experiments on denoising, dereverberation, and super - resolution tasks are carried out on multiple datasets, showing that CMGAN outperforms the existing state - of - the - art methods.
### Experimental results:
- **Denoising task**: On the Voice Bank + DEMAND dataset, CMGAN has a PESQ score of 3.41 and an SSNR of 11.10 dB, significantly exceeding previous models.
- **Dereverberation task**: The effectiveness of CMGAN is verified through multiple objective scoring metrics and comparative analysis.
- **Super - resolution task**: The method of complex TF - domain super - resolution is explored. By introducing a reconstruction mask, the training network can focus on estimating the missing high - frequency bands.
In conclusion, through proposing the CMGAN model, this paper aims to solve the key problems in monaural speech enhancement and has achieved significant performance improvements in multiple tasks.