Abstract:Gaze estimation, which seeks to reveal where a person is looking, provides a crucial clue for understanding human intentions and behaviors. Recently, Visual Transformer has achieved promising results in gaze estimation. However, dividing facial images into patches compromises the integrity of the image structure, which limits the inference performance. To tackle this challenge, we present Gaze-Swin, an end-to-end gaze estimation model formed with a dual-branch CNN-Transformer architecture. In Gaze-Swin, we adopt the Swin Transformer as the backbone network due to its effectiveness in handling long-range dependencies and extracting global features. Additionally, we incorporate a convolutional neural network as an auxiliary branch to capture local facial features and intricate texture details. To further enhance robustness and address overfitting issues in gaze estimation, we replace the original self-attention in the Transformer branch with Dropkey Assisted Attention (DA-Attention). In particular, this DA-Attention treats keys in the Transformer block as Dropout units and employs a decay Dropout rate schedule to preserve crucial gaze representations in deeper layers. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of our method in comparison to the state of the art.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key challenges in gaze estimation based on facial images. Specifically: 1. **Image Structure Integrity Issue**: Existing Visual Transformers (ViT) segment facial images into multiple patches. While this helps in handling long-range dependencies, it disrupts the overall structure of the image, thereby affecting inference performance. 2. **Balance Between Local and Global Feature Extraction**: Traditional Convolutional Neural Networks (CNN) excel at extracting local features but have limitations in capturing long-range dependencies and global features. Conversely, Transformers perform well in handling long-range dependencies but may not fully capture local details. 3. **Overfitting Issue**: In gaze estimation tasks, datasets typically contain a limited number of participants and variations, which can lead to model overfitting. Therefore, a mechanism is needed to enhance the model's robustness and generalization ability. To address these challenges, the authors propose **Gaze-Swin**, a dual-branch architecture model that combines CNN and Swin Transformer, and introduces the Dropkey Assisted Attention Mechanism (DA-Attention) to further improve the model's robustness and prevent overfitting. ### Solution 1. **Dual-Branch CNN-Transformer Architecture**: - **Swin Transformer Branch**: Utilizes Swin Transformer as the backbone network, effectively handling long-range dependencies and extracting global features through its hierarchical transformer structure and shifted window mechanism. - **CNN Branch**: Employs Convolutional Neural Networks (CNN) as an auxiliary branch to capture local facial features and complex texture details. 2. **Dropkey Assisted Attention Mechanism (DA-Attention)**: - Treats the keys in the Transformer blocks as Dropout units and adopts a decayed Dropout rate schedule to retain critical gaze representations in deeper layers, thereby enhancing the model's robustness and stability. ### Experimental Results The authors conducted extensive experiments on three benchmark datasets, and the results show that Gaze-Swin outperforms existing methods in gaze estimation tasks. Notably, Gaze-Swin excels in datasets with diverse head poses, such as Gaze360. ### Main Contributions 1. **Innovatively applying a dual-branch CNN-Transformer architecture**, addressing the common image structure distortion issue in ViT for gaze estimation. 2. **Introducing the DA-Attention mechanism**, enhancing the model's robustness and reducing the risk of overfitting. 3. **Achieving state-of-the-art performance on three benchmark datasets**, demonstrating the effectiveness and superiority of Gaze-Swin. Through these innovations, Gaze-Swin has made significant progress in the field of gaze estimation, providing new directions for future related research.

Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism

Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism

Gaze Estimation Based on the Improved Xception Network

Gaze Estimation using Transformer

HybridGazeNet: Geometric model guided Convolutional Neural Networks for gaze estimation

Highly efficient gaze estimation method using online convolutional re-parameterization

Monocular 3D gaze estimation using feature discretization and attention mechanism

A Generalized and Robust Method Towards Practical Gaze Estimation on Smart Phone

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

Merging Multiple Datasets for Improved Appearance-Based Gaze Estimation

Fine-grained gaze estimation based on the combination of regression and classification losses

Gaze Estimation via Modulation-based Adaptive Network with Auxiliary Self-Learning

Adaptive Swin Transformers for Few-Shot Cross-Domain Silent Face Liveness Detection

Transgaze: exploring plain vision transformers for gaze estimation

Deep Multitask Gaze Estimation with a Constrained Landmark-Gaze Model

EG-Net: Appearance-based eye gaze estimation using an efficient gaze network with attention mechanism

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Gaze Estimation Method Combining Facial Feature Extractor with Pyramid Squeeze Attention Mechanism