Graph Based Emotion Recognition with Attention Pooling for Variable-Length Utterances

Jiawang Liu,Haoxiang Wang,Mingze Sun,Yao Wei
DOI: https://doi.org/10.1016/j.neucom.2022.05.007
IF: 6
2022-01-01
Neurocomputing
Abstract:Previous speech emotion recognition (SER) methods normally deal with variable-length utterance inputs by padding shorter ones or clipping longer ones into equal-length utterances, which may introduce invalid information or discard useful emotional segments. To address this issue, in this paper, we cast the SER problem into a graph classification task by transforming variable-length utterances into graphs to avoid padding or cutting. In our approach, frames (short windowed segments) in an utterance are presented as nodes in a graph. Acoustic features extracted from frames are treated as node feature vectors and nodes are connected according to their temporal relationship. Different graph convolutional networks (GCNs) are explored for node/frame embedding learning, and kinds of graph pooling methods are compared to obtain graph/utterance-level emotional representation from node embeddings. Extensive experiments with different GCN components and pooling mechanisms are conducted on the IEMOCAP and MSP-IMPRO datasets. The experimental results show that a combination of GraphSAGE with multi-head attention pooling (MHAPool) achieves the best weighted accuracy (WA) and comparable unweighted accuracy (UA) on both datasets compared with other state-of-the-art SER models, which demonstrates the effectiveness of the proposed graph-based network for SER task.
What problem does this paper attempt to address?