Multi-level Feature Fusion Capsule Network with Self-Attention for Facial Expression Recognition
Zhiji Huang,Songsen Yu,Jun Liang
DOI: https://doi.org/10.1117/1.jei.32.2.023038
IF: 0.829
2023-01-01
Journal of Electronic Imaging
Abstract:Different from generic image classification, fine-grained classification, such as facial expression classification, in which multiple expressions share inherently similar underlying facial appearances, may show a small difference between facial expression classes. Unlike lab-controlled data, facial expressions from natural scenes have rich forms of the same expression due to the diversity of subjects and the complexity of real-world conditions, and as a result, facial expressions may have large differences among samples within the same class. Moreover, there is little difference between facial expressions, and facial expressions are displayed simultaneously through various facial regions, which require us to encode the feature of multiple key regions, forming high-order interactive information. To address the aforementioned problems, we design an enhanced capsule network based on multi-level feature fusion attention mechanism, which is comprised of four critical components: multi-level feature extraction module (MFEM), multi-level attention module (MAM), multi-level capsule attention fusion module (MCAFM), and reconstruction module (RM). The MFEM collects the low-level, middle-level, and high-level features from the input image, therefore lowering the high-level convolution layer's susceptibility to blurred image and the problem of pose variation. The MAM directs the network's attention to the most significant features in different levels of image features and can assist the network in ignoring blurred, occluded, and irrelevant features and incorporating them into our self-attention center loss function to compress the element distribution in the same class. The MCAFM preserves the attributes of each face region (such as location, size, and direction) by transferring the features into capsules in preparation for the eventual creation of the dynamic routing mechanism, which can resolve the problem of image rotation on FER in the wild. Simultaneously, the capsule features of distinct areas are combined to provide higher-order overall feature information, enhancing the model's capacity to discriminate between different kinds of expressions. The RM reconstructs the image and calculates the difference between the reconstructed image and the original input image. Our model outperforms a large number of current methods on two public datasets, RAF-DB and SFEW.