SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

Yicheng Deng,Hideaki Hayashi,Hajime Nagahara
2024-07-30
Abstract:Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)^2 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address a key challenge in facial expression recognition—accurately locating the occurrence intervals (start and end frames) of facial expressions in videos, particularly the detection of micro-expressions (MEs). Specifically: 1. **Addressing the impact of irrelevant facial movements**: Traditional methods struggle to distinguish facial expressions from unrelated actions such as head movements and blinking, which affects recognition accuracy. 2. **Capturing subtle changes**: Micro-expressions usually last for a very short duration (less than 0.5 seconds) and have low intensity, making them difficult to detect. Existing deep learning-based methods fail to effectively reveal these subtle changes when extracting optical flow features. To address the above issues, the paper proposes the following solutions: - **Multi-Resolution Sliding Window Optical Flow Features (SW-MRO)**: By setting an appropriate sliding window length, it can capture complete micro-expressions and distinguish between macro-expressions (MaEs) and micro-expressions, thereby improving the accuracy of frame-level probability estimation. - **SpotFormer Framework**: This is a multi-scale spatiotemporal transformer that can simultaneously encode spatiotemporal relationships, enhancing the model's ability to distinguish between different types of expressions. The paper introduces a Facial Local Graph Pooling (FLGP) operation to extract multi-scale spatial features and utilizes learning-based temporal down-sampling techniques to extract multi-scale temporal features. - **Supervised Contrastive Learning**: Enhances the model's discriminative ability through contrastive learning, particularly in distinguishing different types of expression boundaries in long videos. In summary, the paper aims to improve the accuracy of facial expression recognition by proposing a new framework, making significant progress especially in micro-expression detection.