MaskSR: Masked Language Model for Full-band Speech Restoration

Xu Li,Qirui Wang,Xiaoyu Liu
2024-06-04
Abstract:Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions. Although several deep learning paradigms have been studied for this task, the power of the recently emerging language models has not been fully explored. In this paper, we propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth. MaskSR works with discrete acoustic tokens extracted using a pre-trained neural codec. During training, MaskSR is optimized to predict randomly masked tokens extracted from the high quality target speech, conditioned on the corrupted speech with various distortions. During inference, MaskSR reconstructs the target speech tokens with efficient iterative sampling. Extensive experiments show that MaskSR obtains competitive results on both the full-band speech restoration task and also on sub-tasks compared with a wide range of models.
Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing,Signal Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to restore high - quality full - band (44.1 kHz) speech in the presence of multiple distortions. Specifically, the paper proposes a masked language model named MaskSR, which aims to jointly handle problems such as noise, reverberation, clipping and low - bandwidth, so as to restore high - quality full - band speech. Compared with traditional denoising and dereverberation methods, MaskSR can not only handle these distortions, but also deal with generative tasks, such as bandwidth expansion and packet - loss concealment. ### Main contributions: 1. **Propose MaskSR model**: MaskSR is a masked - based language model that can restore high - quality full - band speech in the presence of multiple distortions. 2. **Jointly handle multiple distortions**: MaskSR can simultaneously handle multiple distortions such as noise, reverberation, clipping and low - bandwidth, rather than just a single type of distortion. 3. **Efficient inference mechanism**: Through iterative sampling, MaskSR can efficiently reconstruct the target speech during the inference stage. 4. **Experimental verification**: The paper verifies the competitiveness of MaskSR in full - band speech restoration tasks and subtasks through extensive experiments, and compares it with a variety of existing models. ### Technical details: - **Neural audio encoder**: Use the pre - trained Descript Audio Codec (DAC) to convert high - quality target speech signals into discrete acoustic tokens. - **Speech encoder**: Encode the impaired speech signal and extract its features. - **Masked language model**: By randomly masking some acoustic tokens, train the model to predict these masked tokens. - **Inference process**: During the inference stage, gradually generate target speech tokens through iterative sampling. ### Experimental results: - **Full - band speech restoration**: MaskSR performs excellently in the full - band speech restoration task, especially in terms of bandwidth expansion. - **Multi - task performance**: On the test set containing multiple distortions, MaskSR also achieves competitive results. - **Subjective evaluation**: Through expert listening evaluation, MaskSR is significantly superior to other systems in terms of overall speech quality. ### Conclusion: MaskSR provides a powerful framework that can restore high - quality full - band speech in the presence of multiple distortions. This model is not only innovative in technology, but also shows excellent performance in practical applications. Future work will further improve the quality and comprehensibility of the generated speech.