Abstract:Facial expressions help individuals convey their emotions. In recent years, thanks to the development of computer vision technology, facial expression recognition (FER) has become a research hotspot and made remarkable progress. However, human faces in real-world environments are affected by various unfavorable factors, such as facial occlusion and head pose changes, which are seldom encountered in controlled laboratory settings. These factors often lead to a reduction in expression recognition accuracy. Inspired by the recent success of transformers in many computer vision tasks, we propose a model called the fine-tuned channel–spatial attention transformer (FT-CSAT) to improve the accuracy of recognition of FER in the wild. FT-CSAT consists of two crucial components: channel–spatial attention module and fine-tuning module. In the channel–spatial attention module, the feature map is input into the channel attention module and the spatial attention module sequentially. The final output feature map will effectively incorporate both channel information and spatial information. Consequently, the network becomes adept at focusing on relevant and meaningful features associated with facial expressions. To further improve the model's performance while controlling the number of excessive parameters, we employ a fine-tuning method. Extensive experimental results demonstrate that our FT-CSAT outperforms the state-of-the-art methods on two benchmark datasets: RAF-DB and FERPlus. The achieved recognition accuracy is 88.61% and 89.26%, respectively. Furthermore, to evaluate the robustness of FT-CSAT in the case of facial occlusion and head pose changes, we take tests on Occlusion-RAF-DB and Pose-RAF-DB data sets, and the results also show that the superior recognition performance of the proposed method under such conditions.

Expression snippet transformer for robust video-based facial expression recognition

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

Micro-expression Spotting with Multi-scale Local Transformer in Long Videos

Developing a model of associations between chronic pain, depressive mood, chronic fatigue, and self-efficacy in people with spinal cord injury.

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Robust facial expression recognition with Transformer Block Enhancement Module

Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

Fine-Grained Temporal-Enhanced Transformer for Dynamic Facial Expression Recognition

Pubertal status of children and adolescents during orthodontic treatment

Facial Expression Recognition Based on Fine-Tuned Channel–Spatial Attention Transformer

SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

Spatial-Temporal Graphs Plus Transformers for Geometry-Guided Facial Expression Recognition

Facial micro-expression recognition using three-stream vision transformer network with sparse sampling and relabeling

Coarse-to-Fine Cascaded Networks with Smooth Predicting for Video Facial Expression Recognition

Attention on Emotions: A Vision Transformer Approach to Advancing Facial Expression Recognition

CDGT: Constructing diverse graph transformers for emotion recognition from facial videos

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

Clip-aware expressive feature learning for video-based facial expression recognition

VidFace: A Full-Transformer Solver for Video FaceHallucination with Unaligned Tiny Snapshots