A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition

Jinchao Li,Xixin Wu,Kaitao Song,Dongsheng Li,Xunying Liu,Helen Meng
2023-03-15
Abstract:As a common way of emotion signaling via non-linguistic vocalizations, vocal burst (VB) plays an important role in daily social interaction. Understanding and modeling human vocal bursts are indispensable for developing robust and general artificial intelligence. Exploring computational approaches for understanding vocal bursts is attracting increasing research attention. In this work, we propose a hierarchical framework, based on chain regression models, for affective recognition from VBs, that explicitly considers multiple relationships: (i) between emotional states and diverse cultures; (ii) between low-dimensional (arousal & valence) and high-dimensional (10 emotion classes) emotion spaces; and (iii) between various emotion classes within the high-dimensional space. To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules. The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE'' tasks. Experimental results based on the ACII Challenge 2022 dataset demonstrate the superior performance of the proposed system and the effectiveness of considering multiple relationships using hierarchical regression chain models.
Audio and Speech Processing,Machine Learning,Sound,Signal Processing
What problem does this paper attempt to address?
The paper attempts to address the problem of recognizing emotions through non-verbal sounds (such as emotional outbursts). Specifically, the researchers propose a hierarchical framework based on a chain regression model to explicitly consider multiple relationships, including: 1. **The relationship between emotional states and different cultures**: Emotional expressions may vary across different cultural backgrounds, and these differences need to be modeled. 2. **The relationship between low-dimensional emotional space (arousal and valence) and high-dimensional emotional space (10 emotional categories)**: How to infer high-dimensional emotional categories from low-dimensional emotional features. 3. **The relationship between various emotional categories within the high-dimensional emotional space**: There are correlations between different emotional categories, and these correlations need to be modeled. In addition, to address the issue of data sparsity, the researchers also used self-supervised learning (SSL) representations and incorporated hierarchical and temporal aggregation modules. This framework achieved excellent performance in the ACII Affective Vocal Bursts Challenge 2022, ranking 1st in the "TWO" and "CULTURE" tasks. Experimental results show that this framework has superior performance in handling the task of emotion burst recognition.