Abstract:Automatic emotion recognition (AER) systems are burgeoning and systems based on either audio, video, text, or physiological signals have emerged. Multimodal systems, in turn, have shown to improve overall AER accuracy and to also provide some robustness against artifacts and missing data. Collecting multiple signal modalities, however, can be very intrusive, time consuming, and expensive. Recent advances in deep learning based speech-to-text and natural language processing systems, however, have enabled the development of reliable multimodal systems based on speech and text while only requiring the collection of audio data. Audio data, however, is extremely sensitive to environmental disturbances, such as additive noise, thus faces some challenges when deployed “in the wild.” To overcome this issue, speech enhancement algorithms have been deployed at the input signal level to improve testing accuracy in noisy conditions. Speech enhancement algorithms can come in different flavors and can be optimized for different tasks (e.g., for human perception vs. machine performance). Data augmentation, in turn, has also been deployed at the model level during training time to improve accuracy in noisy testing conditions. In this paper, we explore the combination of task-specific speech enhancement and data augmentation as a strategy to improve overall multimodal emotion recognition in noisy conditions. We show that AER accuracy under noisy conditions can be improved to levels close to those seen in clean conditions. When compared against a system without speech enhancement or data augmentation, an increase in AER accuracy of 40% was seen in a cross-corpus test, thus showing promising results for “in the wild” AER.

Improving speech recognition using data augmentation and acoustic model fusion

Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Acoustic Model Fusion for End-to-end Speech Recognition

Improving Speech Emotion Recognition With Adversarial Data Augmentation Network

Adversarial Data Augmentation for Robust Speaker Verification

Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-corpus Setting for Speech Emotion Recognition

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

Acoustic Modeling for Multi-Array Conversational Speech Recognition in the Chime-6 Challenge

Domain Adaptation with Augmented Data by Deep Neural Network Based Method Using Re-Recorded Speech for Automatic Speech Recognition in Real Environment

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Space-and-speaker-aware Acoustic Modeling with Effective Data Augmentation for Recognition of Multi-Array Conversational Speech

Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Auxiliary Features from Laser-Doppler Vibrometer Sensor for Deep Neural Network Based Robust Speech Recognition

Acoustic data augmentation for small passive acoustic monitoring datasets

oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models

DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Speech Recognition with Augmented Synthesized Speech