AVT: Au-Assisted Visual Transformer for Facial Expression Recognition

Rijin Jin,Sirui Zhao,Zhongkai Hao,Yifan Xu,Tong Xu,Enhong Chen
DOI: https://doi.org/10.1109/icip46576.2022.9897960
2022-01-01
Abstract:Facial expression recognition (FER) has made significant progress over the past few years. But how to overcome the problem of high inter-class similarity and large intra-class difference in FER is still challenging. To address this problem, we propose a novel FER framework called AU-assisted Visual Transformer (AVT) by incorporating facial action units (AU) information into Visual Transformer, which mainly consists of three modules: Local Feature Extraction (LFE) module, Global Relationship Modeling (GRM) module and AU Fusion Module (AFM). Specifically, the LFE module aims to extract local facial expression features by using a deep convolutional neural network, the GRM module is a multi-layer Transformer encoder that captures the relation between local facial regions and obtains a global understanding of the face, and in particular, the AFM introduces fine-grained AU feature and fuses it with expression feature for final classification. Extensive experiments are conducted on RAF-DB and FERPlus datasets, and our AVT achieves competitive results compared to previous state-of-the-art methods, demonstrating the effectiveness of our approach.
What problem does this paper attempt to address?