Expression snippet transformer for robust video-based facial expression recognition

Yuanyuan Liu,Wenbin Wang,Chuanxu Feng,Haoyu Zhang,Zhe Chen,Yibing Zhan
DOI: https://doi.org/10.1016/j.patcog.2023.109368
IF: 8
2023-02-09
Pattern Recognition
Abstract:Although Transformer can be powerful for modeling visual relations and describing complicated patterns, it could still perform unsatisfactorily for video-based facial expression recognition, since the expression movements in a video can be too small to reflect meaningful spatial-temporal relations. To this end, we propose to decompose the modeling of expression movements of a video into the modeling of a series of expression snippets, each of which contains a few frames, and then boost the Transformer's ability for intra-snippet and inter-snippet visual modeling, respectively, obtaining the Expression snippet Transformer (EST). For intra-snippet modeling, we devise an attention-augmented snippet feature extractor to enhance the encoding of subtle facial movements of each snippet. For inter-snippet modeling, we introduce a shuffled snippet order prediction head and a corresponding loss to improve the modeling of subtle motion changes across subsequent snippets. The EST obtains state-of-the-art performance, demonstrating its superiority to other CNN-based methods. Our code and the trained model are available at https://github.com/DreamMr/EST
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?