ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

Azmine Toushik Wasi,Karlo Šerbetar,Raima Islam,Taki Hasan Rafi,Dong-Kyu Chae
2024-10-24
Abstract:In this paper, we introduce a framework ARBEx, a novel attentive feature extraction framework driven by Vision Transformer with reliability balancing to cope against poor class distributions, bias, and uncertainty in the facial expression learning (FEL) task. We reinforce several data pre-processing and refinement methods along with a window-based cross-attention ViT to squeeze the best of the data. We also employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions with reliability balancing, which is a strategy that leverages anchor points, attention scores, and confidence values to enhance the resilience of label predictions. To ensure correct label classification and improve the models' discriminative power, we introduce anchor loss, which encourages large margins between anchor points. Additionally, the multi-head self-attention mechanism, which is also trainable, plays an integral role in identifying accurate labels. This approach provides critical elements for improving the reliability of predictions and has a substantial positive effect on final prediction capabilities. Our adaptive model can be integrated with any deep neural network to forestall challenges in various recognition tasks. Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the Facial Expression Learning (FEL) task. Specifically, these problems include: 1. **Processing of global information**: Existing FEL methods are unable to fully capture the global information of input images due to the limitation of the local receptive field of Convolutional Neural Networks (CNNs). 2. **Inter - class similarity**: There may be very similar images among different expression categories, which makes classification difficult. 3. **Intra - class differences**: Images within the same expression category may have significant differences due to factors such as skin color, gender, background, and age. 4. **Scale sensitivity**: Changes in image quality and resolution will affect the performance of deep - learning models, especially without proper pre - processing. To address these challenges, the authors propose a new framework named ARBEx (Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning). This framework combines the following key techniques: - **Vision Transformer (ViT)**: It is used for feature extraction and deals with common problems in the FEL task, such as scale sensitivity and intra - class differences, through multi - level feature extraction and integration. - **Reliability balancing mechanism**: By introducing learnable anchors and multi - head self - attention mechanisms, the robustness of the model to uncertainty and unbalanced data is enhanced. - **Data augmentation and pre - processing**: It adopts significant data augmentation techniques and optimizes the training batch selection method to reduce the risk of over - fitting. Through these improvements, the ARBEx framework can provide more stable and reliable prediction performance in various facial expression recognition tasks. Experimental results show that this method significantly outperforms the existing state - of - the - art methods on multiple public datasets. ### Key formulas 1. **Cross - attention calculation**: \[ q = z_{lm}W_q, \quad k = z_{img}W_k, \quad v = z_{img}W_v \] \[ o(i)=\text{softmax}\left(\frac{q(i)k(i)^T}{\sqrt{d}+b}\right)v(i), \quad i = 1,\ldots,I \] \[ o=[o(1),\ldots,o(I)]W_o \] 2. **Transformer encoder**: \[ X'_{img}=\text{W - MCSA}(X_{img})+X_{img} \] \[ X_{img}^O=\text{MLP}(\text{Norm}(X'_{img}))+X'_{img} \] 3. **Multi - head self - attention mechanism**: \[ X_o'=\text{MHSA}(X_o)+X_o \] \[ X_o^{\text{out}}=\text{MLP}(\text{Norm}(X_o))+X_o' \] 4. **Confidence function**: \[ C(l)=1 - H(l) \] \[ H(l)=-\sum_i l_i\log(l_i) \] 5. **Label correction**: \[ t_g(e)=\sum_{i = 1}^N\sum_{j = 1}^K s_{ij}(e)m_{ij} \] \[ t_a=\text{softmax}(W_{\text{out}}) \] \[ t=\frac{c_g t_g + c_a t_