Facial micro-expression recognition using three-stream vision transformer network with sparse sampling and relabeling

He Zhang,Lu Yin,Hanling Zhang,Xuesong Wu
DOI: https://doi.org/10.1007/s11760-024-03039-x
2024-02-20
Abstract:Most existing micro-expression recognition (MER) methods are based on convolutional neural networks (CNN) and could obtain better representations than conventional handcrafted-based methods. Nevertheless, the local receptive field of CNN leads to poor global feature extraction and thus limits the accuracy. In contrast, the vision transformer, an alternative technique, could capture global facial information and perform superiority over CNN in many vision tasks. However, directly applying it to MER may not be as effective as expected since the insufficient data and class-imbalanced characteristics of existing ME datasets could seriously restrict the accuracy. To address these problems, we propose a three-stream vision transformer-based network with sparse sampling and relabeling (SSRLTS-ViT). First, the network learns discriminative ME representations from three optical flow components. Second, a sparse sampling strategy is employed to add the optical flow components computed by the onset and images around the apex into training sets, which can expand the sample capacity and simultaneously guarantee the differences between data. Moreover, we introduce a relabeling mechanism to reassign the training data with correct labels to decrease the impact caused by subjectivity annotations, which can further improve recognition accuracy. Experimental results on two benchmarks show that SSRLTS-ViT outperforms other competing methods by obtaining the UF1 of 0.843 and UAR of 0.853 on the 3-class datasets and the UF1 of 0.795 and UAR of 0.801 on the 5-class datasets, respectively.
engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?