Action Unit Detection by Exploiting Spatial-Temporal and Label-Wise Attention with Transformer

Lingfeng Wang,Jin Qi,Jian Cheng,Kenji Suzuki
DOI: https://doi.org/10.1109/cvprw56347.2022.00276
2022-01-01
Abstract:The facial action units (FAU) defined by the Facial Action Coding System (FACS) has become an important approach of facial expression analysis. Most work on FAU detection only considers the spatial-temporal feature and ignores the label-wise AU correlation. In practice, the strong relationships between facial AUs can help AU detection. We proposed a transformer based FAU detection model by leverage both the local spatial-temporal features and label-wise FAU correlation. To be specific, we firstly designed a visual spatial-temporal transformer based model and a convolution based audio model to extract action unit specific features. Secondly, inspired by the relationship between FAUs, we proposed a transformer based correlation module to learn correlation between AUs. The action unit specific features from aural and visual models are further aggregated in the correlation modules to produce per-frame prediction of 12 AUs. Our model was trained on Aff-Wild2 dataset of the ABAW3 challenge and achieved state of art performance in the FAU task, which verified that the effectiveness of the proposed network.
What problem does this paper attempt to address?