Abstract:Exploring proper way to conduct multi-speech feature fusion for cross-corpus speech emotion recognition is crucial as different speech features could provide complementary cues reflecting human emotion status. While most previous approaches only extract a single speech feature for emotion recognition, existing fusion methods such as concatenation, parallel connection, and splicing ignore heterogeneous patterns in the interaction between features and features, resulting in performance of existing systems. In this paper, we propose a novel graph-based fusion method to explicitly model the relationships between every pair of speech features. Specifically, we propose a multi-dimensional edge features learning strategy called Graph-based multi-Feature fusion method for speech emotion recognition. It represents each speech feature as a node and learns multi-dimensional edge features to explicitly describe the relationship between each feature-feature pair in the context of emotion recognition. This way, the learned multi-dimensional edge features encode speech feature-level information from both the vertex and edge dimensions. Our Approach consists of three modules: an Audio Feature Generation(AFG)module, an Audio-Feature Multi-dimensional Edge Feature(AMEF) module and a Speech Emotion Recognition (SER) module. The proposed methodology yielded satisfactory outcomes on the SEWA dataset. Furthermore, the method demonstrated enhanced performance compared to the baseline in the AVEC 2019 Workshop and Challenge. We used data from two cultures as our training and validation sets: two cultures containing German and Hungarian on the SEWA dataset, the CCC scores for German are improved by 17.28% for arousal and 7.93% for liking. The outcomes of our methodology demonstrate a 13% improvement over alternative fusion techniques, including those employing one dimensional edge-based feature fusion approach.

Geometrical and Pixel Based Lip Feature Fusion in Speech Synthesis System Driven by Visual-speech

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion.

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Audio-Visual Speech Recognition Using A Two-Step Feature Fusion Strategy.

Visual Features Extracting & Selecting For Lipreading

Regression Based Landmark Estimation and Multi-Feature Fusion for Visual Speech Recognition.

Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features

Incorporating Lip Features into Audio-Visual Multi-Speaker DOA Estimation by Gated Fusion

Lip Assistant: Visualize Speech For Hearing Impaired People In Multimedia Services

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Fusion of visual and acoustic signals for command-word recognition

Graph-based multi-Feature fusion method for speech emotion recognition

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

Audio-Visual System for Robust Speaker Recognition.

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging

Mutual Information Maximization for Effective Lip Reading

Fusion of deep shallow features and models for speaker recognition

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model