Abstract:Airborne speech enhancement is always a major challenge for the security of airborne systems. Recently, multi-objective learning technology has become one of the mainstream methods of monaural speech enhancement. In this paper, we propose a novel multi-objective method for airborne speech enhancement, called the stacked multiscale densely connected temporal convolutional attention network (SMDTANet). More specifically, the core of SMDTANet includes three parts, namely a stacked multiscale feature extractor, a triple-attention-based temporal convolutional neural network (TA-TCNN), and a densely connected prediction module. The stacked multiscale feature extractor is leveraged to capture comprehensive feature information from noisy log-power spectra (LPS) inputs. Then, the TA-TCNN adopts a combination of these multiscale features and noisy amplitude modulation spectrogram (AMS) features as inputs to improve its powerful temporal modeling capability. In TA-TCNN, we integrate the advantages of channel attention, spatial attention, and T-F attention to design a novel triple-attention module, which can guide the network to suppress irrelevant information and emphasize informative features of different views. The densely connected prediction module is used to reliably control the flow of the information to provide an accurate estimation of clean LPS and the ideal ratio mask (IRM). Moreover, a new joint-weighted (JW) loss function is constructed to further improve the performance without adding to the model complexity. Extensive experiments on real-world airborne conditions show that our SMDTANet can obtain an on-par or better performance compared to other reference methods in terms of all the objective metrics of speech quality and intelligibility.

Adversarial Multi-Task Learning with Inverse Mapping for Speech Enhancement

Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement

Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement

Speech enhancement aided end-to-end multi-task learning for voice activity detection

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Training Multi-Task Adversarial Network for Extracting Noise-Robust Speaker Embedding

Adaptive multi-task learning for speech to text translation

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion

Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Multi-Task Learning Improves Synthetic Speech Detection

SE Territory: Monaural Speech Enhancement Meets the Fixed Virtual Perceptual Space Mapping

An Adversarial Learning based Multi-Step Spoken Language Understanding System through Human-Computer Interaction

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

Multi-Stage Progressive Speech Enhancement Network

An Iterative Post-processing Approach for Speech Enhancement

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech

Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement

Speaker-Invariant Training Via Adversarial Learning.