MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition

Linhuang Wang,Xin Kang,Fei Ding,Satoshi Nakagawa,Fuji Ren
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446699
2024-04-12
Abstract:Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the challenges in Dynamic Facial Expression Recognition (DFER), particularly how to effectively capture and utilize subtle temporal changes in facial muscles. Unlike typical video action recognition, DFER does not involve obvious moving targets but relies on local changes in facial muscles. To tackle this characteristic, the authors propose a Multi-Scale Spatio-Temporal Convolution-Transformer Network (MSST-Net), aiming to improve the accuracy of dynamic facial expression recognition through the fusion of multi-scale spatial and temporal features. Specifically, the main contributions of the paper include: 1. **Multi-Scale Embedding Layer (MELayer)**: This layer can extract spatial features at different scales and encode these features before feeding them into the Temporal Transformer (T-Former). 2. **Temporal Transformer (T-Former)**: This module continuously integrates multi-scale spatial information while extracting temporal information, ultimately generating multi-scale spatio-temporal features. 3. **Experimental Validation**: Experiments conducted on two commonly used real-world datasets (DFEW and FERV39k) demonstrate the effectiveness of the proposed method. The experimental results show that MSST-Net achieves state-of-the-art performance on these datasets. In summary, the paper proposes a novel approach to handle spatio-temporal features in dynamic facial expression recognition, significantly improving recognition accuracy through the fusion of multi-scale spatial and temporal information.