Abstract:Speech enhancement algorithms are applied in multiple levels of enhancement to improve the quality of speech signals under noisy environments known as multi-channel speech enhancement (MCSE) systems. Numerous existing algorithms are used to filter noise in speech enhancement systems, which are typically employed as a pre-processor to reduce noise and improve speech quality. They may, however, be limited in performing well under low signal-to-noise ratio (SNR) situations. The speech devices are exposed to all kinds of environmental noises which may go up to a high-level frequency of noises. The objective of this research is to conduct a noise reduction experiment for a multi-channel speech enhancement (MCSE) system in stationary and non-stationary environmental noisy situations with varying speech signal SNR levels. The experiments examined the performance of the existing and the proposed MCSE systems for environmental noises in filtering low to high SNRs environmental noises (−10 dB to 20 dB). The experiments were conducted using the AURORA and LibriSpeech datasets, which consist of different types of environmental noises. The existing MCSE (BAV-MCSE) makes use of beamforming, adaptive noise reduction and voice activity detection algorithms (BAV) to filter the noises from speech signals. The proposed MCSE (DWT-CNN-MCSE) system was developed based on discrete wavelet transform (DWT) preprocessing and convolution neural network (CNN) for denoising the input noisy speech signals to improve the performance accuracy. The performance of the existing BAV-MCSE and the proposed DWT-CNN-MCSE were measured using spectrogram analysis and word recognition rate (WRR). It was identified that the existing BAV-MCSE reported the highest WRR at 93.77% for a high SNR (at 20 dB) and 5.64% on average for a low SNR (at −10 dB) for different noises. The proposed DWT-CNN-MCSE system has proven to perform well at a low SNR with WRR of 70.55% and the highest improvement (64.91% WRR) at −10 dB SNR.

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments

MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION

The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

2D-to-2d Mask Estimation for Speech Enhancement Based on Fully Convolutional Neural Network

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

End-to-End Multi-speaker Speech Recognition with Transformer.

CLOSING THE GAP BETWEEN TIME-DOMAIN MULTI-CHANNEL SPEECH ENHANCEMENT ON REAL AND SIMULATION CONDITIONS

Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reduction

NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party.

Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.