MPCSAN: Multi-Head Parallel Channel-Spatial Attention Network for Facial Expression Recognition in the Wild
Weijun Gong,Yurong Qian,Yingying Fan
DOI: https://doi.org/10.1007/s00521-022-08040-4
2022-01-01
Neural Computing and Applications
Abstract:Facial expression recognition (FER) in the wild is an exceedingly challenging task in computer vision due to subtle differences, poses, occlusions, label bias, and other uncontrollable factors. CNN-based deep learning networks are susceptible to the above factors, resulting in the inability to obtain highly discriminative features on the key regions of expressions, and most methods of learning in a single feature space may not fully capture the core regions of interest. These will directly affect the solution to the problem of intra-class variability and inter-class similarity of expressions, which ultimately affects the recognition performance. Therefore, we propose an effective multi-head parallel channel-spatial attention network (MPCSAN) for FER in the wild, which consists of a feature aggregation network (FAN), a multi-head parallel attention network (MPAN), and an expression forecasting network (EFN). First, the lightweight FAN network extracts basic expression features while optimizing intra-class and inter-class distribution. Then, MPAN forms a multi-attention subspace by a multi-head parallel channel-space attention fusion design and focuses on more accurate and comprehensive expression regions of interest by minimizing duplicate attention during subspace fusion. Finally, EFN performs the final expression classification under the optimization of label softening, which further improves the robustness problem caused by label bias. Our proposed method is evaluated on the three most widely used wild expression datasets (RAF-DB, FERPlus, and AffectNet). The extensive experimental results demonstrate that our method outperforms several current state-of-the-art methods, achieving accuracies of 90.16% on RAF-DB, 89.91% on FERPlus, and 61.58% on AffectNet, respectively. Occlusion and pose variation datasets evaluation and cross-dataset assessment further demonstrate the good comprehensive performance of our method.