Abstract:Airborne speech enhancement is always a major challenge for the security of airborne systems. Recently, multi-objective learning technology has become one of the mainstream methods of monaural speech enhancement. In this paper, we propose a novel multi-objective method for airborne speech enhancement, called the stacked multiscale densely connected temporal convolutional attention network (SMDTANet). More specifically, the core of SMDTANet includes three parts, namely a stacked multiscale feature extractor, a triple-attention-based temporal convolutional neural network (TA-TCNN), and a densely connected prediction module. The stacked multiscale feature extractor is leveraged to capture comprehensive feature information from noisy log-power spectra (LPS) inputs. Then, the TA-TCNN adopts a combination of these multiscale features and noisy amplitude modulation spectrogram (AMS) features as inputs to improve its powerful temporal modeling capability. In TA-TCNN, we integrate the advantages of channel attention, spatial attention, and T-F attention to design a novel triple-attention module, which can guide the network to suppress irrelevant information and emphasize informative features of different views. The densely connected prediction module is used to reliably control the flow of the information to provide an accurate estimation of clean LPS and the ideal ratio mask (IRM). Moreover, a new joint-weighted (JW) loss function is constructed to further improve the performance without adding to the model complexity. Extensive experiments on real-world airborne conditions show that our SMDTANet can obtain an on-par or better performance compared to other reference methods in terms of all the objective metrics of speech quality and intelligibility.

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

An Attention-Based Neural Network Approach For Single Channel Speech Enhancement

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Two-stage unet with channel and temporal-frequency attention for multi-channel speech enhancement

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

SE Territory: Monaural Speech Enhancement Meets the Fixed Virtual Perceptual Space Mapping

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

Convolutional Recurrent Neural Network with Attention for 3D Speech Enhancement

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

A Feature Integration Network for Multi-Channel Speech Enhancement

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention