A Spatiotemporal Network Using a Local Spatial Difference Stack Block for Facial Micro-Expression Recognition
Yan Liang,Yan Hao,Jiacheng Liao,Zhuoran Deng,Xing Wen,Zefeng Zheng,Jiahui Pan
DOI: https://doi.org/10.1007/s11042-023-16033-1
IF: 2.577
2024-01-01
Multimedia Tools and Applications
Abstract:Recently, video-based micro-expression recognition (MER) applications have attracted attention in various scenarios. However, current deep learning-based MER methods frequently struggle with several challenges, such as insufficient data, difficulty in capturing subtle facial motions, and keyframe recognition. In this paper, we propose a robust MER solution without prior annotation of keyframes. To prevent traditional data augmentation techniques from destroying the slight motion information in the sequence frames, stride sampling is designed to increase the number of samples while preserving the important motion features of the micro-expression (ME). Moreover, to capture facial rapid and subtle changes to enhance the accuracy of ME classification, we construct a local spatial difference stack (LSDS) block and incorporate it into the lightweight spatiotemporal network VGGFace-TCN. Experiments demonstrate that our proposed algorithm can effectively detect the local facial movement details of MEs from original frames without additional visual features, e.g., optical flow, and minimize the risk of overfitting. Compared with other state-of-the-art methods, the proposed method obtained the best performance under the holdout database evaluation (HDE) strategy with an accuracy and F1-score of 57.46% and 0.3734, respectively. Furthermore, it attained an accuracy of 61.27% and an F1-score of 0.5343 on the Spontaneous Actions and Micro-movements (SAMM) dataset, which is significantly higher than other state-of-the-art methods.