Low-latency Speech Enhancement via Speech Token Generation

Huaying Xue,Xiulian Peng,Yan Lu
2024-01-23
Abstract:Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
Sound,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the speech enhancement problem in low - latency scenarios. Specifically, the existing deep - learning - based speech enhancement methods mainly rely on a data - driven approach, achieving denoising through a large amount of audio data with different noise types. However, this highly data - dependent method has poor generalization ability when facing complex and unseen noises in real - life. To solve this problem, the author proposes a new conditional generation framework, regarding speech enhancement as a task of generating clear speech based on noisy signals. Different from traditional methods, the new method does not identify and remove noise but directly generates clean speech. The main features of this method include: 1. **Acoustic encoding based on neural speech codec**: Use the pre - trained TF - Codec to encode clean speech into discrete acoustic codes. 2. **Autoregressive generation model**: Employ an autoregressive Transformer decoder to generate the acoustic code of the current frame according to the past noisy frames. 3. **Explicit alignment scheme**: Improve the robustness of the model and its adaptability to different input lengths by explicitly aligning the noisy features with the clean speech codes to be generated. 4. **Single - stage causal speech generation**: Utilize a single - stage generation method to reduce latency while ensuring high quality. The experimental results show that this method outperforms traditional data - driven methods on both synthetic and real - recording test sets, especially in terms of noise robustness and time - series consistency. In addition, ablation experiments also verify the effectiveness of the explicit alignment scheme and its good scalability for long sequences. ### Formula summary - Conditional generation modeling formula for low - latency speech enhancement tasks: \[ P(Y|X)=\prod_{t = 1}^{T}p(y_t|y_{<t},x_{\leq t}) \] - Formula for generation problems based on acoustic encoding: \[ P(C|N)=\prod_{t = 1}^{T}p(C_t|C_{<t},N_{\leq t}) \] These formulas show how to transform the speech enhancement task into a conditional generation problem of generating clean speech based on noisy signals and gradually generate the acoustic code for each time step in an autoregressive manner.