Abstract:Facial expressions help individuals convey their emotions. In recent years, thanks to the development of computer vision technology, facial expression recognition (FER) has become a research hotspot and made remarkable progress. However, human faces in real-world environments are affected by various unfavorable factors, such as facial occlusion and head pose changes, which are seldom encountered in controlled laboratory settings. These factors often lead to a reduction in expression recognition accuracy. Inspired by the recent success of transformers in many computer vision tasks, we propose a model called the fine-tuned channel–spatial attention transformer (FT-CSAT) to improve the accuracy of recognition of FER in the wild. FT-CSAT consists of two crucial components: channel–spatial attention module and fine-tuning module. In the channel–spatial attention module, the feature map is input into the channel attention module and the spatial attention module sequentially. The final output feature map will effectively incorporate both channel information and spatial information. Consequently, the network becomes adept at focusing on relevant and meaningful features associated with facial expressions. To further improve the model's performance while controlling the number of excessive parameters, we employ a fine-tuning method. Extensive experimental results demonstrate that our FT-CSAT outperforms the state-of-the-art methods on two benchmark datasets: RAF-DB and FERPlus. The achieved recognition accuracy is 88.61% and 89.26%, respectively. Furthermore, to evaluate the robustness of FT-CSAT in the case of facial occlusion and head pose changes, we take tests on Occlusion-RAF-DB and Pose-RAF-DB data sets, and the results also show that the superior recognition performance of the proposed method under such conditions.

A Cascaded Spatiotemporal Attention Network for Dynamic Facial Expression Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

LSTPNet: Long short-term perception network for dynamic facial expression recognition in the wild

A Cascade Attention Based Facial Expression Recognition Network by Fusing Multi-Scale Spatio-Temporal Features

MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition

A multi-scale multi-attention network for dynamic facial expression recognition

Multi-Attention Module for Dynamic Facial Emotion Recognition

SAANet: Siamese Action-Units Attention Network for Improving Dynamic Facial Expression Recognition

Automatic 4D Facial Expression Recognition via Collaborative Cross-domain Dynamic Image Network.

A Dual-Direction Attention Mixed Feature Network for Facial Expression Recognition

Spatio-Temporal Facial Expression Recognition Using Convolutional Neural Networks and Conditional Random Fields

Dual-STI: Dual-path spatial-temporal interaction learning for dynamic facial expression recognition

Towards Reading Beyond Faces for Sparsity-Aware 4D Affect Recognition

Two-pathway attention network for real-time facial expression recognition

Coarse-to-Fine Cascaded Networks with Smooth Predicting for Video Facial Expression Recognition

Patch Attention Network for Video Facial Expression Recognition.

PASTFNet: a paralleled attention spatio-temporal fusion network for micro-expression recognition

Facial Expression Recognition Using Hybrid Features of Pixel and Geometry

Facial Expression Recognition Based on Fine-Tuned Channel–Spatial Attention Transformer

Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild

3-D Facial Expression Recognition via Attention-Based Multichannel Data Fusion Network