CDGT: Constructing diverse graph transformers for emotion recognition from facial videos

Dongliang Chen,Guihua Wen,Huihui Li,Pei Yang,Chuyun Chen,Bao Wang
DOI: https://doi.org/10.1016/j.neunet.2024.106573
2024-07-25
Abstract:Recognizing expressions from dynamic facial videos can find more natural affect states of humans, and it becomes a more challenging task in real-world scenes due to pose variations of face, partial occlusions and subtle dynamic changes of emotion sequences. Existing transformer-based methods often focus on self-attention to model the global relations among spatial features or temporal features, which cannot well focus on important expression-related locality structures from both spatial and temporal features for the in-the-wild expression videos. To this end, we incorporate diverse graph structures into transformers and propose a CDGT method to construct diverse graph transformers for efficient emotion recognition from in-the-wild videos. Specifically, our method contains a spatial dual-graphs transformer and a temporal hyperbolic-graph transformer. The former deploys a dual-graph constrained attention to capture latent emotion-related graph geometry structures among local spatial tokens for efficient feature representation, especially for the video frames with pose variations and partial occlusions. The latter adopts a hyperbolic-graph constrained self-attention that explores important temporal graph structure information under hyperbolic space to model more subtle changes of dynamic emotion. Extensive experimental results on in-the-wild video-based facial expression databases show that our proposed CDGT outperforms other state-of-the-art methods.
What problem does this paper attempt to address?