Abstract:Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Improve the noise resistance of Computer Audition (CA) tasks in practical application scenarios**. Specifically, neural network models are easily affected by noise pollution in real - world applications when performing audio tasks such as Automatic Speech Recognition (ASR) and Acoustic Scene Classification (ASC). To improve audio quality, an independently developed enhancement module is usually used at the front end of the target audio application. However, this method may introduce unnecessary distortion and artifacts, thus limiting its performance improvement. To solve these problems, this paper proposes an end - to - end learning solution, aiming to jointly optimize the Audio Enhancement (AE) model and the subsequent application model. By using sample - level performance metrics as an indication of sample importance, this method focuses specifically on overcoming difficult samples, thereby guiding the AE module to optimize for specific applications. Experimental results show that the proposed training paradigm can significantly improve the noise resistance of various computer audition tasks in noisy daily - life environments, especially under low Signal - to - Noise Ratio (SNR) conditions. ### Main contributions 1. **Joint optimization framework**: An iterative training paradigm is proposed, enabling the AE module and the CA task to promote each other and be jointly optimized. 2. **Sample importance guidance**: The sample difficulty is measured by the sample - level loss, thereby guiding the AE module to focus more on those samples that are crucial for the CA task. 3. **Wide applicability**: This method has been verified on four representative tasks (ASR, SCR, SER, and ASC), covering different characteristics of speech and non - speech tasks. ### Mathematical formulas The formulas involved in the paper are mainly used to describe the loss function and optimization process of the model: - Weighted SDR loss function: \[ L_{\text{AE}}(x, \hat{x})=\alpha L_{\text{SDR}}(x, \hat{x})+(1 - \alpha) L_{\text{SDR}}(n, \hat{n}) \] where \[ n = y - x\quad\text{and}\quad\hat{n}=y - \hat{x} \] represent the true noise signal and the estimated noise signal respectively, and \[ L_{\text{SDR}}(x, \hat{x}) = -\frac{\langle x, \hat{x}\rangle}{\|x\|\cdot\|\hat{x}\|} \] represents the Signal - to - Distortion Ratio (SDR), and \[ \alpha=\frac{\|x\|^{2}}{\|x\|^{2}+\|n\|^{2}} \] Through these improvements, this paper provides an effective method to improve the robustness and performance of computer audition systems in complex noise environments.

Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

A Hybrid Speech Enhancement Algorithm for Voice Assistance Application

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

Dynamic noise aware training for speech enhancement based on deep neural networks.

A Refining Underlying Information Framework for Monaural Speech Enhancement

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Characterization of Deep Learning-Based Speech-Enhancement Techniques in Online Audio Processing Applications

Speech Intelligibility Based Enhancement System Using Modified Deep Neural Network and Adaptive Multi-band Spectral Subtraction

A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement

An objective evaluation of Hearing Aids and DNN-based speech enhancement in complex acoustic scenes

Computer Audition: From Task-Specific Machine Learning to Foundation Models

Hybrid Noise Reduction And Enhancement of Audio Quality using Deep Learning

High-Fidelity Noise Reduction with Differentiable Signal Processing

Improving Robustness and Clinical Applicability of Automatic Respiratory Sound Classification Using Deep Learning-Based Audio Enhancement: Algorithm Development and Validation Study

Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments