Abstract:Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions. Although several deep learning paradigms have been studied for this task, the power of the recently emerging language models has not been fully explored. In this paper, we propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth. MaskSR works with discrete acoustic tokens extracted using a pre-trained neural codec. During training, MaskSR is optimized to predict randomly masked tokens extracted from the high quality target speech, conditioned on the corrupted speech with various distortions. During inference, MaskSR reconstructs the target speech tokens with efficient iterative sampling. Extensive experiments show that MaskSR obtains competitive results on both the full-band speech restoration task and also on sub-tasks compared with a wide range of models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to restore high - quality full - band (44.1 kHz) speech in the presence of multiple distortions. Specifically, the paper proposes a masked language model named MaskSR, which aims to jointly handle problems such as noise, reverberation, clipping and low - bandwidth, so as to restore high - quality full - band speech. Compared with traditional denoising and dereverberation methods, MaskSR can not only handle these distortions, but also deal with generative tasks, such as bandwidth expansion and packet - loss concealment. ### Main contributions: 1. **Propose MaskSR model**: MaskSR is a masked - based language model that can restore high - quality full - band speech in the presence of multiple distortions. 2. **Jointly handle multiple distortions**: MaskSR can simultaneously handle multiple distortions such as noise, reverberation, clipping and low - bandwidth, rather than just a single type of distortion. 3. **Efficient inference mechanism**: Through iterative sampling, MaskSR can efficiently reconstruct the target speech during the inference stage. 4. **Experimental verification**: The paper verifies the competitiveness of MaskSR in full - band speech restoration tasks and subtasks through extensive experiments, and compares it with a variety of existing models. ### Technical details: - **Neural audio encoder**: Use the pre - trained Descript Audio Codec (DAC) to convert high - quality target speech signals into discrete acoustic tokens. - **Speech encoder**: Encode the impaired speech signal and extract its features. - **Masked language model**: By randomly masking some acoustic tokens, train the model to predict these masked tokens. - **Inference process**: During the inference stage, gradually generate target speech tokens through iterative sampling. ### Experimental results: - **Full - band speech restoration**: MaskSR performs excellently in the full - band speech restoration task, especially in terms of bandwidth expansion. - **Multi - task performance**: On the test set containing multiple distortions, MaskSR also achieves competitive results. - **Subjective evaluation**: Through expert listening evaluation, MaskSR is significantly superior to other systems in terms of overall speech quality. ### Conclusion: MaskSR provides a powerful framework that can restore high - quality full - band speech in the presence of multiple distortions. This model is not only innovative in technology, but also shows excellent performance in practical applications. Future work will further improve the quality and comprehensibility of the generated speech.

MaskSR: Masked Language Model for Full-band Speech Restoration

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Masking-based Neural Beamformer for Multichannel Speech Enhancement

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Masks Fusion with Multi-Target Learning For Speech Enhancement

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

Bridging the Gap Between Monaural Speech Enhancement and Recognition With Distortion-Independent Acoustic Modeling

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

DM: Dual-path Magnitude Network for General Speech Restoration

A Mask Free Neural Network for Monaural Speech Enhancement

Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time

SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling