Abstract:Facial expression recognition (FER) plays a crucial role in affective computing, enhancing human-computer interaction by enabling machines to understand and respond to human emotions. Despite advancements in deep learning, current FER systems often struggle with challenges such as occlusions, head pose variations, and motion blur in natural environments. These challenges highlight the need for more robust FER solutions. To address these issues, we propose the Attention-Enhanced Multi-Layer Transformer (AEMT) model, which integrates a dual-branch Convolutional Neural Network (CNN), an Attentional Selective Fusion (ASF) module, and a Multi-Layer Transformer Encoder (MTE) with transfer learning. The dual-branch CNN captures detailed texture and color information by processing RGB and Local Binary Pattern (LBP) features separately. The ASF module selectively enhances relevant features by applying global and local attention mechanisms to the extracted features. The MTE captures long-range dependencies and models the complex relationships between features, collectively improving feature representation and classification accuracy. Our model was evaluated on the RAF-DB and AffectNet datasets. Experimental results demonstrate that the AEMT model achieved an accuracy of 81.45% on RAF-DB and 71.23% on AffectNet, significantly outperforming existing state-of-the-art methods. These results indicate that our model effectively addresses the challenges of FER in natural environments, providing a more robust and accurate solution. The AEMT model significantly advances the field of FER by improving the robustness and accuracy of emotion recognition in complex real-world scenarios. This work not only enhances the capabilities of affective computing systems but also opens new avenues for future research in improving model efficiency and expanding multimodal data integration.

Collaborative Attention Transformer on Facial Expression Recognition under Partial Occlusion

Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion

Facial Expression Recognition Based on Fine-Tuned Channel–Spatial Attention Transformer

A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

Facial Expression Recognition Based on Multi-Scale Convolutional Vision Transformer

Facial expression recognition in facial occlusion scenarios: A path selection multi-network

Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition

Hybrid Attention-Aware Learning Network for Facial Expression Recognition in the Wild

Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition

MViT: Mask Vision Transformer for Facial Expression Recognition in the Wild

Lossless Attention in Convolutional Networks for Facial Expression Recognition in the Wild

CF-DAN: Facial-expression recognition based on cross-fusion dual-attention network

Automatic 4D Facial Expression Recognition via Collaborative Cross-domain Dynamic Image Network.

FER-former: Multi-modal Transformer for Facial Expression Recognition

Facial Expression Recognition Based on Zero-Addition Pretext Training and Feature Conjunction-Selection Network in Human–Robot Interaction

POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition

Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing

Robust facial expression recognition with Transformer Block Enhancement Module

Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism