Abstract:Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. In this paper, we propose a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). First, for the audio stream, a fully convolutional network (FCN) equipped with 1-D attention mechanism and local response normalization is designed for speech emotion recognition. Next, a global FBP (G-FBP) approach is presented to perform audio-visual information fusion by integrating selfattention based video stream with the proposed audio stream. To improve G-FBP, an adaptive strategy (AG-FBP) to dynamically calculate the fusion weight of two modalities is devised based on the emotion-related representation vectors from the attention mechanism of respective modalities. Finally, to fully utilize the local emotion information, adaptive and multi-level FBP (AMFBP) is introduced by combining both global-trunk and intratrunk data in one recording on top of AG-FBP. Tested on the IEMOCAP corpus for speech emotion recognition with only audio stream, the new FCN method outperforms the state-ofthe-art results with an accuracy of 71.40%. Moreover, validated on the AFEW database of EmotiW2019 sub-challenge and the IEMOCAP corpus for audio-visual emotion recognition, the proposed AM-FBP approach achieves the best accuracy of 63.09% and 75.49% respectively on the test set.

An information fusion approach to recognizing microphone array speech in the CHiME-3 challenge based on a deep learning framework

An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech

On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reduction

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

An Iterative Mask Estimation Approach to Deep Learning Based Multi-Channel Speech Recognition

Acoustic Modeling for Multi-Array Conversational Speech Recognition in the Chime-6 Challenge

Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Deepfake Detection System for the ADD Challenge Track 3.2 Based on Score Fusion

Acoustic Model Fusion for End-to-end Speech Recognition

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Acoustic Model Ensembling Using Effective Data Augmentation for CHiME-5 Challenge.

A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge.

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.