Abstract:Airborne speech enhancement is always a major challenge for the security of airborne systems. Recently, multi-objective learning technology has become one of the mainstream methods of monaural speech enhancement. In this paper, we propose a novel multi-objective method for airborne speech enhancement, called the stacked multiscale densely connected temporal convolutional attention network (SMDTANet). More specifically, the core of SMDTANet includes three parts, namely a stacked multiscale feature extractor, a triple-attention-based temporal convolutional neural network (TA-TCNN), and a densely connected prediction module. The stacked multiscale feature extractor is leveraged to capture comprehensive feature information from noisy log-power spectra (LPS) inputs. Then, the TA-TCNN adopts a combination of these multiscale features and noisy amplitude modulation spectrogram (AMS) features as inputs to improve its powerful temporal modeling capability. In TA-TCNN, we integrate the advantages of channel attention, spatial attention, and T-F attention to design a novel triple-attention module, which can guide the network to suppress irrelevant information and emphasize informative features of different views. The densely connected prediction module is used to reliably control the flow of the information to provide an accurate estimation of clean LPS and the ideal ratio mask (IRM). Moreover, a new joint-weighted (JW) loss function is constructed to further improve the performance without adding to the model complexity. Extensive experiments on real-world airborne conditions show that our SMDTANet can obtain an on-par or better performance compared to other reference methods in terms of all the objective metrics of speech quality and intelligibility.

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

CAT-DUnet: Enhancing Speech Dereverberation via Feature Fusion and Structural Similarity Loss

Convolutional Recurrent Neural Network with Attention for 3D Speech Enhancement

Speech Enhancement Using U-Net with Compressed Sensing

A Multi-scale Subconvolutional U-Net with Time-Frequency Attention Mechanism for Single Channel Speech Enhancement

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

Multichannel Speech Enhancement without Beamforming

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

A Subconvolutional U-net with Gated Recurrent Unit and Efficient Channel Attention Mechanism for Real-Time Speech Enhancement

MB-DECTNet: A Model-Based Unrolled Network for Accurate 3D DECT Reconstruction

Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement

Explore Relative and Context Information with Transformer for Joint Acoustic Echo Cancellation and Speech Enhancement

RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

Single-Channel Speech Enhancement with Deep Complex U-Networks and Probabilistic Latent Space Models

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression