Abstract:We introduce a novel automatic detection method for facial action units (AUs) that leverages both spatial and temporal data, enhancing accuracy and robustness in expression analysis and facial animation. Our approach utilizes a Temporal feature Combination and Feature Reassignment (TC&FR) module to transform and fuse features across multiple subjects and temporal sequences. Moreover, by integrating a Regional Attention (RA) encoder and a transformer model, our method refines the extraction and processing of regional features, ensuring more precise identification and analysis of AUs. This integration not only harnesses identity‐independent features but also maximizes the temporal context, significantly improving the reliability of AU predictions. Facial action units (AUs) encode the activations of facial muscle groups, playing a crucial role in expression analysis and facial animation. However, current deep learning AU detection methods primarily focus on single‐image analysis, which limits the exploitation of rich temporal context for robust outcomes. Moreover, the scale of available datasets remains limited, leading models trained on these datasets to tend to suffer from overfitting issues. This paper proposes a novel AU detection method integrating spatial and temporal data with inter‐subject feature reassignment for accurate and robust AU predictions. Our method first extracts regional features from facial images. Then, to effectively capture both the temporal context and identity‐independent features, we introduce a temporal feature combination and feature reassignment (TC&FR) module, which transforms single‐image features into a cohesive temporal sequence and fuses features across multiple subjects. This transformation encourages the model to utilize identity‐independent features and temporal context, thus ensuring robust prediction outcomes. Experimental results demonstrate the enhancements brought by the proposed modules and the state‐of‐the‐art (SOTA) results achieved by our method.

Vision Transformer for Action Units Detection

Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection

Multi-modal Multi-label Facial Action Unit Detection with Transformer

Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Facial action units detection using temporal context and feature reassignment

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Transformer-based Multimodal Information Fusion for Facial Expression Analysis

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

An Attention-based Method for Action Unit Detection at the 3rd ABAW Competition

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

AVT: Au-Assisted Visual Transformer for Facial Expression Recognition

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Multi-modal Facial Action Unit Detection with Large Pre-trained Models for the 5th Competition on Affective Behavior Analysis in-the-wild

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Region And Temporal Dependency Fusion For Multi-Label Action Unit Detection

AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

Spatio-Temporal AU Relational Graph Representation Learning For Facial Action Units Detection

Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion