Abstract:Facial Action Units (AUs) serve as a precise descriptor of facial expressions, revealing an individual's psychological and mental state. Therefore, AU detection plays important roles in facial expression recognition. Existing methods often focus on extracting intra-frame information while pay less attention to inter-frame feature changes. To address this issue, this paper proposes a self-attention spatiotemporal fusion method (SAtt-STPN). In this method, a feature extractor (AFE) is specifically designed to extract uniform feature information from both strongly and weakly correlated regions. A spatiotemporal perception (STP) module is specifically designed to capture temporal information for each AU through mutually-driven independent branches in both spatial and temporal dimensions while a graph convolutional network is adopted to model intra-frame AU relationships (ARM). Ultimately, intra-frame and inter-frame information are weighted and fused for classification. Experimental results on two public datasets (BP4D and DISFA) show that the our proposed SAtt-STPN outperforms state-of-the-art methods in facial AU detection.

Facial Action Unit Recognition Based on Self-Attention Spatiotemporal Fusion