DST: Deformable Speech Transformer for Emotion Recognition

Weidong Chen,Xiaofen Xing,Xiangmin Xu,Jianxin Pang,Lan Du
DOI: https://doi.org/10.1109/icassp49357.2023.10096966
2023-01-01
Abstract:Enabled by multi-head self-attention, Transformer has exhibited remarkable results in speech emotion recognition (SER). Compared to the original full attention mechanism, window-based attention is more effective in learning fine-grained features while greatly reducing model redundancy. However, emotional cues are present in a multi-granularity manner such that the pre-defined fixed window can severely degrade the model flexibility. In addition, it is difficult to obtain the optimal window settings manually. In this paper, we propose a Deformable Speech Transformer, named DST, for SER task. DST determines the usage of window sizes conditioned on in-put speech via a light-weight decision network. Meanwhile, data-dependent offsets derived from acoustic features are utilized to adjust the positions of the attention windows, allowing DST to adaptively discover and attend to the valuable in-formation embedded in the speech. Extensive experiments on IEMOCAP and MELD demonstrate the superiority of DST.
What problem does this paper attempt to address?