Abstract:A multi-level distortion measure (MLDM) is proposed as an objective to optimize deep neural network-based speech enhancement (SE) in both audio-only and audio-visual scenarios. The aim is to achieve simultaneous performance improvements in speech quality, intelligibility, and recognition error reductions. Moreover, a comprehensive correlation analysis shows that these three evaluation metrics exhibit high Pearson correlation coefficient (PCC) values with three commonly used optimization objectives: the mean squared error between the ideal ratio and estimated magnitude masks, scale-invariant signal-to-noise ratio, and cross-entropy-guided measure. To further improve the performance, we leverage the complementarities of the three objectives and propose another correlated multi-level distortion measure (C-MLDM) defined as a weighted combination of MLDM and an average correlation measure based on the three PCCs. Experimental results on the TCD-TIMIT corpus corrupted by additive noise demonstrate that MLDM outperforms systems optimized with each objective in both audio-visual and audio-only scenarios, offering improved performances in all three metrics: speech quality, intelligibility, and recognition performance. C-MLDM also consistently outperforms MLDM in all test cases. Finally, the generalizability of both MLDM and C-MLDM is confirmed through extensive testing across diverse datasets, SE model architectures, and linguistic conditions.

Compensation of Speech Enhancement Distortion for Robust Speech Recognition

Research on Bandwidth Mismatch Compensation in Speech Recognition

Simplified Deformation Compensation for Emotional Speaker Recognition

Robust speaker recognition using glottal information‐based cepstral mean subtraction

Statistical Thresholding for Robust ASR

Channel Compensation for Robust Telephone Speech Recognition

An Efficient Robust Asr System Based On The Combination Of Speech Enhancement And Hmm Adaptation

Cepstral Shape Normalization (CSN) for Robust Speech Recognition

Modified MFCCs for Robust Speaker Recognition

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

Robust Speech Recognition Method Based on Discriminative Environment Feature Extraction

An Algorithm of Model Compensation Based on the Estimation of Additive Noise and Channel Function for Speech Recognition

Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition

Robust Speech Recognition Method Based on Discriminative Learning of Environmental Features

Robust telephone speech recognition based on channel compensation

The predictive differential amplitude spectrum for robust speaker recognition in stationary noises

Joint compensation of noise and channel in speech recognition

Robust Speaker Recognition in Cross-Channel Condition

Correlated Multi-Level Speech Enhancement for Robust Real-World ASR Applications Using Mask-Waveform-Feature Optimization

Channel Compensation Technique HNSSM for Speaker Recognition

A Feature Compensation Approach Using Piecewise Linear Approximation of an Explicit Distortion Model for Noisy Speech Recognition