Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism

Ruijie Zhao,Yuhuan Wang,Sihui Luo,Suyao Shou,Pinyan Tang
DOI: https://doi.org/10.3390/electronics13020328
IF: 2.9
2024-01-13
Electronics
Abstract:Gaze estimation, which seeks to reveal where a person is looking, provides a crucial clue for understanding human intentions and behaviors. Recently, Visual Transformer has achieved promising results in gaze estimation. However, dividing facial images into patches compromises the integrity of the image structure, which limits the inference performance. To tackle this challenge, we present Gaze-Swin, an end-to-end gaze estimation model formed with a dual-branch CNN-Transformer architecture. In Gaze-Swin, we adopt the Swin Transformer as the backbone network due to its effectiveness in handling long-range dependencies and extracting global features. Additionally, we incorporate a convolutional neural network as an auxiliary branch to capture local facial features and intricate texture details. To further enhance robustness and address overfitting issues in gaze estimation, we replace the original self-attention in the Transformer branch with Dropkey Assisted Attention (DA-Attention). In particular, this DA-Attention treats keys in the Transformer block as Dropout units and employs a decay Dropout rate schedule to preserve crucial gaze representations in deeper layers. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of our method in comparison to the state of the art.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key challenges in gaze estimation based on facial images. Specifically: 1. **Image Structure Integrity Issue**: Existing Visual Transformers (ViT) segment facial images into multiple patches. While this helps in handling long-range dependencies, it disrupts the overall structure of the image, thereby affecting inference performance. 2. **Balance Between Local and Global Feature Extraction**: Traditional Convolutional Neural Networks (CNN) excel at extracting local features but have limitations in capturing long-range dependencies and global features. Conversely, Transformers perform well in handling long-range dependencies but may not fully capture local details. 3. **Overfitting Issue**: In gaze estimation tasks, datasets typically contain a limited number of participants and variations, which can lead to model overfitting. Therefore, a mechanism is needed to enhance the model's robustness and generalization ability. To address these challenges, the authors propose **Gaze-Swin**, a dual-branch architecture model that combines CNN and Swin Transformer, and introduces the Dropkey Assisted Attention Mechanism (DA-Attention) to further improve the model's robustness and prevent overfitting. ### Solution 1. **Dual-Branch CNN-Transformer Architecture**: - **Swin Transformer Branch**: Utilizes Swin Transformer as the backbone network, effectively handling long-range dependencies and extracting global features through its hierarchical transformer structure and shifted window mechanism. - **CNN Branch**: Employs Convolutional Neural Networks (CNN) as an auxiliary branch to capture local facial features and complex texture details. 2. **Dropkey Assisted Attention Mechanism (DA-Attention)**: - Treats the keys in the Transformer blocks as Dropout units and adopts a decayed Dropout rate schedule to retain critical gaze representations in deeper layers, thereby enhancing the model's robustness and stability. ### Experimental Results The authors conducted extensive experiments on three benchmark datasets, and the results show that Gaze-Swin outperforms existing methods in gaze estimation tasks. Notably, Gaze-Swin excels in datasets with diverse head poses, such as Gaze360. ### Main Contributions 1. **Innovatively applying a dual-branch CNN-Transformer architecture**, addressing the common image structure distortion issue in ViT for gaze estimation. 2. **Introducing the DA-Attention mechanism**, enhancing the model's robustness and reducing the risk of overfitting. 3. **Achieving state-of-the-art performance on three benchmark datasets**, demonstrating the effectiveness and superiority of Gaze-Swin. Through these innovations, Gaze-Swin has made significant progress in the field of gaze estimation, providing new directions for future related research.