A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition

Jinchao Li,Xixin Wu,Kaitao Song,Dongsheng Li,Xunying Liu,Helen Meng

2023-03-15

Abstract:As a common way of emotion signaling via non-linguistic vocalizations, vocal burst (VB) plays an important role in daily social interaction. Understanding and modeling human vocal bursts are indispensable for developing robust and general artificial intelligence. Exploring computational approaches for understanding vocal bursts is attracting increasing research attention. In this work, we propose a hierarchical framework, based on chain regression models, for affective recognition from VBs, that explicitly considers multiple relationships: (i) between emotional states and diverse cultures; (ii) between low-dimensional (arousal & valence) and high-dimensional (10 emotion classes) emotion spaces; and (iii) between various emotion classes within the high-dimensional space. To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules. The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE'' tasks. Experimental results based on the ACII Challenge 2022 dataset demonstrate the superior performance of the proposed system and the effectiveness of considering multiple relationships using hierarchical regression chain models.

Audio and Speech Processing,Machine Learning,Sound,Signal Processing

What problem does this paper attempt to address?

The paper attempts to address the problem of recognizing emotions through non-verbal sounds (such as emotional outbursts). Specifically, the researchers propose a hierarchical framework based on a chain regression model to explicitly consider multiple relationships, including: 1. **The relationship between emotional states and different cultures**: Emotional expressions may vary across different cultural backgrounds, and these differences need to be modeled. 2. **The relationship between low-dimensional emotional space (arousal and valence) and high-dimensional emotional space (10 emotional categories)**: How to infer high-dimensional emotional categories from low-dimensional emotional features. 3. **The relationship between various emotional categories within the high-dimensional emotional space**: There are correlations between different emotional categories, and these correlations need to be modeled. In addition, to address the issue of data sparsity, the researchers also used self-supervised learning (SSL) representations and incorporated hierarchical and temporal aggregation modules. This framework achieved excellent performance in the ACII Affective Vocal Bursts Challenge 2022, ranking 1st in the "TWO" and "CULTURE" tasks. Experimental results show that this framework has superior performance in handling the task of emotion burst recognition.

A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

The ACII 2022 Affective Vocal Bursts Workshop & Competition: Understanding a critically understudied modality of emotional expression

Proceedings of the ACII Affective Vocal Bursts Workshop and Competition 2022 (A-VB): Understanding a critically understudied modality of emotional expression

Deep learning reveals what vocal bursts express in different cultures

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

An Effective Ensemble Learning Framework for Affective Behaviour Analysis

Prior Aided Streaming Network for Multi-task Affective Recognitionat the 2nd ABAW2 Competition

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

A Multimodal Deep Regression Bayesian Network For Affective Video Content Analyses

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Affective Behaviour Analysis via Integrating Multi-Modal Knowledge

Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition

Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion

A New Network Structure for Speech Emotion Recognition Research

An Ensemble Approach for Facial Expression Analysis in Video