Transgaze: exploring plain vision transformers for gaze estimation

Lang Ye,Xinggang Wang,Jingfeng Yao,Wenyu Liu
DOI: https://doi.org/10.1007/s00138-024-01609-0
IF: 2.983
2024-09-23
Machine Vision and Applications
Abstract:Recently, plain vision transformers (ViTs) have shown impressive performance in various computer vision tasks due to their powerful modeling capabilities and large-scale pre-training. However, they have yet to show excellent results in gaze estimation tasks. In this paper, we take the advanced Vision Transformers further into the task of Gaze Estimation (TransGaze). Our framework adeptly integrates the distinctive local features of the eyes while maintaining a simple and flexible structure. It can seamlessly adapt to various large-scale pre-trained models, enhancing its versatility and applicability in different contexts. It first demonstrates the pre-trained ViTs could also show strong capabilities on gaze estimation tasks. Our approach employs the following strategies: (i) Enhancing the self-attention module among facial feature maps through straightforward token manipulation, effectively achieving complex feature fusion, a feat previously requiring more intricate methods; (ii) Leveraging the plain of TransGaze and the inherent adaptability of Plain ViT, we introduce a pre-trained model for gaze estimation. This model reduces training time by over 50 % and exhibits strong generalization performance. We evaluate our TransGaze on GazeCapture and MPIIFaceGaze datasets and achieve state-of-the-art performance with less training costs. Our models and codes will be available.
computer science, cybernetics, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?