A Novel Feature-Fusion-Based Sparse Masked Attention Network for Acoustic Echo Cancellation Using Wavelet and STFT Synergies
V. Soni Ishwarya,Mohanaprasad Kothandaraman
DOI: https://doi.org/10.1007/s00034-024-02955-0
IF: 2.311
2024-12-18
Circuits Systems and Signal Processing
Abstract:Deep learning-based acoustic echo cancellation (AEC) systems have advanced significantly, yet previous methods often rely on a single transform, such as short-time fourier transform (STFT) or constant Q transform, which limits feature richness and leads to heavy models. In contrast, this paper introduces a novel feature-fusion-based encoder-decoder with a sparse masked attention network, specifically designed to enhance echo and background noise suppression. Our model uniquely combines discrete wavelet transform (DWT) and STFT features, leveraging both transforms to achieve richer feature representation. By employing the Daubechies wavelet ("db4"), the model attains high time resolution for high frequencies and improved frequency resolution for low frequencies—crucial for effective noise cancellation. The STFT component captures temporal spectral content, complementing DWT's strengths. To handle double-talk scenarios, a sparse masked attention mechanism selectively focuses on relevant signal windows, reducing computational load while enhancing accuracy. This masked network enables a causal model suitable for real-time applications. Additionally, Smooth L1 loss promotes stable convergence during training. Experimental results on the AEC challenge dataset demonstrate that our model outperforms traditional methods, achieving superior echo return loss enhancement, perceptual evaluation of speech quality, and correlation coefficient, validating its effectiveness in robust echo cancellation and speech quality improvement.
engineering, electrical & electronic