CFAN-SDA: Coarse-fine Aware Network with Static-Dynamic Adaptation for Facial Expression Recognition in Videos

Dongliang Chen,Guihua Wen,Pei Yang,Huihui Li,Chuyun Chen,Bao Wang
DOI: https://doi.org/10.1109/tcsvt.2024.3450652
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Video-based facial expression recognition (FER) is a challenging task due to the dynamic emotional changes with variant frames in video sequences. This paper proposes a novel coarse-fine aware network with static-dynamic adaptation (CFAN-SDA) for in-the wild video-based FER. From coarse to fine, our method leverages cross-domain static FER database to boost video-based FER performance, and then explore hierarchical spatial-temporal feature learning. Specifically, different from existing methods, we design a static-dynamic adaptation learning to explore the knowledge transfer from labeled static images to unlabeled frames of video, which captures the features of coarse-grained emotion to find those important expression-related frames. Furthermore, we present hierarchical spatial-temporal transformers to better learn features of fine-grained expression, which consist of multi-view spatial transformer and frame-clip temporal transformer. The former captures multi-view spatial regions information from global to local, and the latter achieves cross-frame and cross-clip temporal interaction to select the key frame-level and clip-level multi-scale temporal information for fusing. Extensive experimental results on dynamic FER databases indicate that CFAN-SDA achieves superior performance compared to the state-of-the-art models.
What problem does this paper attempt to address?