Facial Action Unit Detection with Emotion Consistency: a Cross-Modal Learning Approach

Wenyu Song,Dongxin Liu,Gaoyun An,Yun Duan,Laifu Wang
DOI: https://doi.org/10.1007/s00530-024-01552-0
IF: 3.9
2024-01-01
Multimedia Systems
Abstract:Facial Action Unit (AU) detection is essential for understanding emotional expressions. This study explores the intricate relationship among AUs, AU descriptions, and facial expressions, emphasizing emotional expression consistency. AUs represent specific facial muscle movements that form the basis of expressions, thus maintaining a solid physiological foundation is crucial for understanding emotional communication. Moreover, AU descriptions serve as linguistic representations and semantic alignment with expressions is paramount. Therefore, the vocabulary in AU descriptions must precisely reflect expression features to ensure coherence between textual and visual cues. Our method, AUTr-emo, employs cross-modal learning, incorporating AU text descriptions as queries and using facial expression recognition as an auxiliary task. This approach highlights the importance of emotional expression consistency across AUs, textual descriptions, and expressions. Extensive experiments are conducted on two challenging datasets, BP4D and DISFA, and experimental results show that our proposed AUTr-emo achieves performance comparable to the state-of-the-art in the field of AU detection.
What problem does this paper attempt to address?