A^3lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP

Zeng Tao,Yan Wang,Junxiong Lin,Haoran Wang,Xinji Mai,Jiawen Yu,Xuan Tong,Ziheng Zhou,Shaoqi Yan,Qing Zhao,Liyuan Han,Wenqiang Zhang
Abstract:The performance of CLIP in dynamic facial expression recognition (DFER) taskdoesn't yield exceptional results as observed in other CLIP-basedclassification tasks. While CLIP's primary objective is to achieve alignmentbetween images and text in the feature space, DFER poses challenges due to theabstract nature of text and the dynamic nature of video, making labelrepresentation limited and perfect alignment difficult. To address this issue,we have designed A^3lign-DFER, which introduces a new DFER labelingparadigm to comprehensively achieve alignment, thus enhancing CLIP'ssuitability for the DFER task. Specifically, our A^3lign-DFER method isdesigned with multiple modules that work together to obtain the most suitableexpanded-dimensional embeddings for classification and to achieve alignment inthree key aspects: affective, dynamic, and bidirectional. We replace the inputlabel text with a learnable Multi-Dimensional Alignment Token (MAT), enablingalignment of text to facial expression video samples in both affective anddynamic dimensions. After CLIP feature extraction, we introduce the JointDynamic Alignment Synchronizer (JAS), further facilitating synchronization andalignment in the temporal dimension. Additionally, we implement a BidirectionalAlignment Training Paradigm (BAP) to ensure gradual and steady training ofparameters for both modalities. Our insightful and concise A^3lign-DFERmethod achieves state-of-the-art results on multiple DFER datasets, includingDFEW, FERV39k, and MAFW. Extensive ablation experiments and visualizationstudies demonstrate the effectiveness of A^3lign-DFER. The code will beavailable in the future.
What problem does this paper attempt to address?