Vision Transformer for Action Units Detection

Tu Vu,Van Thong Huynh,Soo Hyung Kim
2023-03-20
Abstract:Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) competition. We employ the Video Vision Transformer(ViViT) Network to capture the temporal facial change in the video. Besides, to reduce massive size of the Vision Transformers model, we replace the ViViT feature extraction layers with the CNN backbone (Regnet). Our model outperform the baseline model of ABAW 2023 challenge, with a notable 14% difference in result. Furthermore, the achieved results are comparable to those of the top three teams in the previous ABAW 2022 challenge.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the challenge of detecting facial Action Units (AU) in natural environments (i.e., uncontrolled conditions), which is a key challenge in Affective Behavior Analysis. The authors propose a Vision Transformer-based approach to enhance performance in the Affective Behavior Analysis in-the-wild (ABAW) competition's AU detection task. Specifically, the paper addresses the following issues: 1. **Reduce model complexity**: To decrease the enormous size of the Vision Transformer model, researchers replaced some of the feature extraction layers with a CNN backbone (RegNetY), maintaining essential information while alleviating the model's burden. 2. **Temporal facial changes capture**: Utilizing the Video Vision Transformer (ViViT) network to capture facial changes over time in videos, which is crucial for dynamic AU detection. 3. **Ensemble learning strategy**: By employing an ensemble learning scheme with the ViVit model, there was a significant improvement in AU detection performance, with a 14% increase compared to the baseline model of the ABAW 2023 challenge. 4. **Comparison with previous methods**: The proposed model not only surpassed the baseline but also performed comparably to the top three teams in the previous ABAW competition, proving the method's effectiveness and competitiveness. The main contribution of the paper is demonstrating an efficient and significantly effective Vision Transformer-based AU detection method, particularly suitable for the challenging task of analyzing human affective behavior in real-world scenarios.