Abstract:In this paper, a multi-task learning U-shaped neural network (MTU-Net) is proposed and applied to single-channel speech enhancement (SE). The proposed MTU-based SE method estimates an ideal binary mask (IBM) or an ideal ratio mask (IRM) by extending the decoding network of a conventional U-Net to simultaneously model the speech and noise spectra as the target. The effectiveness of the proposed SE method was evaluated under both matched and mismatched noise conditions between training and testing by measuring the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). Consequently, the proposed SE method with IRM achieved a substantial improvement with higher average PESQ scores by 0.17, 0.52, and 0.40 than other state-of-the-art deep-learning-based methods, such as the deep recurrent neural network (DRNN), SE generative adversarial network (SEGAN), and conventional U-Net, respectively. In addition, the STOI scores of the proposed SE method are 0.07, 0.05, and 0.05 higher than those of the DRNN, SEGAN, and U-Net, respectively. Next, voice activity detection (VAD) is also proposed by using the IRM estimated by the proposed MTU-Net-based SE method, which is fundamentally an unsupervised method without any model training. Then, the performance of the proposed VAD method was compared with the performance of supervised learning-based methods using a deep neural network (DNN), a boosted DNN, and a long short-term memory (LSTM) network. Consequently, the proposed VAD methods show a slightly better performance than the three neural network-based methods under mismatched noise conditions.

SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Target Sound Extraction with Variable Cross-modality Clues

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Atss-Net: Target Speaker Separation via Attention-based Neural Network

MSFNet: Multi-Scale Fusion Network for Brain-Controlled Speaker Extraction

Multi-Level Speaker Representation for Target Speaker Extraction

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection

BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions

Binaural Selective Attention Model for Target Speaker Extraction

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues