HiT-MST: Dynamic Facial Expression Recognition with Hierarchical Transformers and Multi-Scale Spatiotemporal Aggregation.

Xiaohan Xia,Dongmei Jiang
DOI: https://doi.org/10.1016/j.ins.2023.119301
IF: 8.1
2023-01-01
Information Sciences
Abstract:Facial expression recognition rarely explores complex spatiotemporal dependencies among facial regions at different scales. This paper proposes a transformer-based three-layer hierarchical architecture that incorporates multi-scale spatiotemporal aggregation for dynamic facial expression recognition. The hierarchical structure consists of bottom-to-top layers, each comprising transformer encoders with local self-attention mechanisms. These encoders gradually expand their receptive fields through hierarchical spatiotemporal aggregation, enabling the modeling of spatiotemporal context dependencies among facial regions at different scales and across consecutive frames. Consequently, the bottom-to-top layers correspond to learning the fine-grained, coarse-grained, and global facial representations. To evaluate the performance of our proposed framework, we conducted extensive experiments on four public datasets. The comparison results demonstrate that our proposed framework outperforms the state-of-the-art, with accuracies of 79.09%, 62.19%, 64.85%, and 59.79% on the RML, eNTERFACE'05, RAVDESS, and AFEW datasets, respectively. Ablation experiments, statistical significance tests, and visualization analyses indicate that the proposed framework successfully learns emotional-salient facial representations.
What problem does this paper attempt to address?