Abstract:The recent development of speech enhancement methods has incorporated attention mechanisms for learning long-term speech signal dependencies. The utilization of deep convolution networks (DCN) equipped with the self-attention (SA) and transformers has showed promising results in speech enhancement (SE). While self-attention networks excel in extracting significant long-sequence contextual information in mining tasks, they may not effectively concentrate on subtle aspects within speech signals. These subtle details include temporal or spectral continuity, spectral structure, and timbre. To tackle this problem, in the current work, we propose a novel speech enhancement model based on adaptive attention. The proposed model incorporates both local and global attention modules in between a convolutional encoder and a convolutional decoder. The local attention module (LAM) integrates channel and spatial attentions, which can make the model pay more attention to the local details in the speech block, specifically the frame-level features. And the features at utterance-level are explored through a self-attention mechanism in global attention module (GAM). Different from existing transformers, the feed forward network of GAM is improved by introducing a 1D-Conv layer and Bi-directional long short-term memory (Bi-LSTM) for extracting global features, so that the network can more effectively model long sequence context. Moreover, a CNN module is also added to global attention module so that short-term noise can be reduced more effectively, based on the ability of CNN to extract local information. The proposed model stands apart from the current speech enhancement techniques that solely rely on self-attention networks. Instead, our approach models the speech signal using two different attention networks simultaneously, both local detail information and global contextual information of speech are considered, thus better extracting useful information from the speech signal. The effectiveness of the proposed model is assessed using both objective (PESQ and STOI) and subjective tests (signal distortion (CSIG), background distortion (CBAK) and overall quality (COVL)) on two distinct datasets: Voice Bank-Demand dataset and LibriSpeech dataset. The experimental findings demonstrate that our model outperformed the competing baselines on both the datasets.

MSAF: A Multiple Self-Attention Field Method for Speech Enhancement

MAMGAN: Multiscale attention metric GAN for monaural speech enhancement in the time domain

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Multiple Generator Gan with Self-Attention for Speech Enhancement

SASEGAN-TCN: Speech enhancement algorithm based on self-attention generative adversarial network and temporal convolutional network

Self-Attention Generative Adversarial Network for Speech Enhancement

CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement

BSS-CFFMA: Cross-Domain Feature Fusion and Multi-Attention Speech Enhancement Network based on Self-Supervised Embedding

Adaptive attention mechanism for single channel speech enhancement

A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition.

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Attention does not guarantee best performance in speech enhancement

SE-MelGAN -- Speaker Agnostic Rapid Speech Enhancement

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion