Abstract:Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the speech enhancement problem in low - latency scenarios. Specifically, the existing deep - learning - based speech enhancement methods mainly rely on a data - driven approach, achieving denoising through a large amount of audio data with different noise types. However, this highly data - dependent method has poor generalization ability when facing complex and unseen noises in real - life. To solve this problem, the author proposes a new conditional generation framework, regarding speech enhancement as a task of generating clear speech based on noisy signals. Different from traditional methods, the new method does not identify and remove noise but directly generates clean speech. The main features of this method include: 1. **Acoustic encoding based on neural speech codec**: Use the pre - trained TF - Codec to encode clean speech into discrete acoustic codes. 2. **Autoregressive generation model**: Employ an autoregressive Transformer decoder to generate the acoustic code of the current frame according to the past noisy frames. 3. **Explicit alignment scheme**: Improve the robustness of the model and its adaptability to different input lengths by explicitly aligning the noisy features with the clean speech codes to be generated. 4. **Single - stage causal speech generation**: Utilize a single - stage generation method to reduce latency while ensuring high quality. The experimental results show that this method outperforms traditional data - driven methods on both synthetic and real - recording test sets, especially in terms of noise robustness and time - series consistency. In addition, ablation experiments also verify the effectiveness of the explicit alignment scheme and its good scalability for long sequences. ### Formula summary - Conditional generation modeling formula for low - latency speech enhancement tasks: \[ P(Y|X)=\prod_{t = 1}^{T}p(y_t|y_{<t},x_{\leq t}) \] - Formula for generation problems based on acoustic encoding: \[ P(C|N)=\prod_{t = 1}^{T}p(C_t|C_{<t},N_{\leq t}) \] These formulas show how to transform the speech enhancement task into a conditional generation problem of generating clean speech based on noisy signals and gradually generate the acoustic code for each time step in an autoregressive manner.

Low-latency Speech Enhancement via Speech Token Generation

Speech Enhancement Via Generative Adversarial Lstm Networks

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

A Conditional Generative Model for Speech Enhancement

Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations

Improved Wasserstein Conditional Generative Adversarial Network Speech Enhancement.

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

A Parallel-Data-Free Speech Enhancement Method Using Multi-Objective Learning Cycle-Consistent Generative Adversarial Network

An End-to-End Speech Enhancement Framework Using Stacked Multi-scale Blocks.

A Weekly Supervised Speech Enhancement Strategy Using Cycle-GAN

Speech Enhancement Based on A New Architecture of Wasserstein Generative Adversarial Networks.

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

Noise Modeling to Build Training Sets for Robust Speech Enhancement

GSC Based Speech Enhancement with Generative Adversarial Network

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

Feature-Matching Speech Denoising GANs via Progressive Training.