Abstract:In this paper, we introduce a framework ARBEx, a novel attentive feature extraction framework driven by Vision Transformer with reliability balancing to cope against poor class distributions, bias, and uncertainty in the facial expression learning (FEL) task. We reinforce several data pre-processing and refinement methods along with a window-based cross-attention ViT to squeeze the best of the data. We also employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions with reliability balancing, which is a strategy that leverages anchor points, attention scores, and confidence values to enhance the resilience of label predictions. To ensure correct label classification and improve the models' discriminative power, we introduce anchor loss, which encourages large margins between anchor points. Additionally, the multi-head self-attention mechanism, which is also trainable, plays an integral role in identifying accurate labels. This approach provides critical elements for improving the reliability of predictions and has a substantial positive effect on final prediction capabilities. Our adaptive model can be integrated with any deep neural network to forestall challenges in various recognition tasks. Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the Facial Expression Learning (FEL) task. Specifically, these problems include: 1. **Processing of global information**: Existing FEL methods are unable to fully capture the global information of input images due to the limitation of the local receptive field of Convolutional Neural Networks (CNNs). 2. **Inter - class similarity**: There may be very similar images among different expression categories, which makes classification difficult. 3. **Intra - class differences**: Images within the same expression category may have significant differences due to factors such as skin color, gender, background, and age. 4. **Scale sensitivity**: Changes in image quality and resolution will affect the performance of deep - learning models, especially without proper pre - processing. To address these challenges, the authors propose a new framework named ARBEx (Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning). This framework combines the following key techniques: - **Vision Transformer (ViT)**: It is used for feature extraction and deals with common problems in the FEL task, such as scale sensitivity and intra - class differences, through multi - level feature extraction and integration. - **Reliability balancing mechanism**: By introducing learnable anchors and multi - head self - attention mechanisms, the robustness of the model to uncertainty and unbalanced data is enhanced. - **Data augmentation and pre - processing**: It adopts significant data augmentation techniques and optimizes the training batch selection method to reduce the risk of over - fitting. Through these improvements, the ARBEx framework can provide more stable and reliable prediction performance in various facial expression recognition tasks. Experimental results show that this method significantly outperforms the existing state - of - the - art methods on multiple public datasets. ### Key formulas 1. **Cross - attention calculation**: \[ q = z_{lm}W_q, \quad k = z_{img}W_k, \quad v = z_{img}W_v \] \[ o(i)=\text{softmax}\left(\frac{q(i)k(i)^T}{\sqrt{d}+b}\right)v(i), \quad i = 1,\ldots,I \] \[ o=[o(1),\ldots,o(I)]W_o \] 2. **Transformer encoder**: \[ X'_{img}=\text{W - MCSA}(X_{img})+X_{img} \] \[ X_{img}^O=\text{MLP}(\text{Norm}(X'_{img}))+X'_{img} \] 3. **Multi - head self - attention mechanism**: \[ X_o'=\text{MHSA}(X_o)+X_o \] \[ X_o^{\text{out}}=\text{MLP}(\text{Norm}(X_o))+X_o' \] 4. **Confidence function**: \[ C(l)=1 - H(l) \] \[ H(l)=-\sum_i l_i\log(l_i) \] 5. **Label correction**: \[ t_g(e)=\sum_{i = 1}^N\sum_{j = 1}^K s_{ij}(e)m_{ij} \] \[ t_a=\text{softmax}(W_{\text{out}}) \] \[ t=\frac{c_g t_g + c_a t_

ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

Towards Mask-robust Face Recognition.

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition

GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution

Efficient Facial Expression Recognition with Representation Reinforcement Network and Transfer Self-Training for Human–Machine Interaction

ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition

RASN: Using Attention and Sharing Affinity Features to Address Sample Imbalance in Facial Expression Recognition

Variance-Aware Bi-Attention Expression Transformer for Open-Set Facial Expression Recognition in the Wild

Twin attention based multi-task convolutional bidirectional long short term memory for facial expression recognition

Adaptively Learning Facial Expression Representation via C-F Labels and Distillation

Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion

Leave No Stone Unturned: Mine Extra Knowledge for Imbalanced Facial Expression Recognition

Hybrid Attention-Aware Learning Network for Facial Expression Recognition in the Wild

Privileged Attribution Constrained Deep Networks for Facial Expression Recognition

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition

Uncertain and Biased Facial Expression Recognition Based on Depthwise Separable Convolutional Neural Network with Embedded Attention Mechanism

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition