Inceptr: micro-expression recognition integrating inception-CBAM and vision transformer

Haoliang Zhou,Shucheng Huang,Yuqiao Xu
DOI: https://doi.org/10.1007/s00530-023-01164-0
IF: 3.9
2023-09-01
Multimedia Systems
Abstract:Micro-Expressions (MEs) are the instantaneous and subtle facial movement that conveys crucial emotional information. However, traditional neural networks face difficulties in accurately capturing the delicate features of MEs due to the limited amount of available data. To address this issue, a dual-branch attention network is proposed for ME recognition, called IncepTR, which can capture attention-aware local and global representations. The network takes optical flow features as input and performs feature extraction using a dual-branch network. First, the Inception model based on the Convolutional Block Attention Module (CBAM) attention mechanism is maintained for multi-scale local feature extraction. Second, the Vision Transformer (ViT) is employed to capture subtle motion features and robustly model global relationships among multiple local patches. Additionally, to enhance the rich relationships between different local patches in ViT, Multi-head Self-Attention Dropping (MSAD) is introduced to drop an attention map randomly, effectively preventing overfitting to specific regions. Finally, the two types of features could be used to learn ME representations effectively through similarity comparison and feature fusion. With such combination, the model is forced to capture the most discriminative multi-scale local and global features while reducing the influence of affective-irrelevant features. Extensive experiments show that the proposed IncepTR achieves UF1 and UAR of 0.753 and 0.746 on the composite dataset MEGC2019-CD, demonstrating better or competitive performance compared to existing state-of-the-art methods for ME recognition.
computer science, information systems, theory & methods
What problem does this paper attempt to address?