Abstract:Finding the eye and parsing out the parts (e.g. pupil and iris) is a key prerequisite for image-based eye tracking, which has become an indispensable module in today's head-mounted VR/AR devices. However, a typical route for training a segmenter requires tedious handlabeling. In this work, we explore an unsupervised way. First, we utilize priors of human eye and extract signals from the image to establish rough clues indicating the eye-region structure. Upon these sparse and noisy clues, a segmentation network is trained to gradually identify the precise area for each part. To achieve accurate parsing of the eye-region, we first leverage the pretrained foundation model Segment Anything (SAM) in an automatic way to refine the eye indications. Then, the learning process is designed in an end-to-end manner following progressive and prior-aware principle. Experiments show that our unsupervised approach can easily achieve 90% (the pupil and iris) and 85% (the whole eye-region) of the performances under supervised learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when performing eye - tracking based on images, how to achieve the segmentation of the eye area (including parts such as the pupil and iris) through an unsupervised learning method. Specifically, the author aims to avoid the cumbersome process of requiring a large amount of manually - annotated data in traditional methods, thereby improving the efficiency and adaptability of model training, especially in the case of rapid hardware iteration. ### Background of the paper and problem description 1. **Importance of eye - tracking**: - Eye - tracking technology has become increasingly important in recent years, especially after being integrated into VR/AR devices. It can provide valuable information about the user's visual process and reveal the user's intentions and behaviors. - This information can be widely applied in multiple fields, such as gaze - based rendering, medical diagnosis, remote support, etc., and has the potential to revolutionize human - computer interaction. 2. **Limitations of traditional methods**: - Traditional methods for eye - area segmentation rely on a large number of manually - annotated data sets, which are not only time - consuming and labor - intensive but also very inefficient in the face of rapid hardware updates. - Manually annotating pixel - level masks is a labor - intensive task and difficult to meet the needs of practical applications. 3. **Research motivation**: - To solve the above problems, the author explored an unsupervised learning method for eye - area segmentation. This method utilizes prior knowledge of the human eye and low - level feature signals in the image, thereby reducing the dependence on manually - annotated data. ### Overview of the solution The method proposed by the author mainly includes the following steps: 1. **Extract rough cues using prior knowledge and image signals**: - Utilize the brightness change law of the human eye (the brightness gradually increases from the pupil to the iris and then to the sclera), and calculate the gradient to initially locate the boundaries of the pupil and iris. - Use the pre - trained base model Segment Anything (SAM) to automatically refine these rough indication signals. 2. **Design an end - to - end unsupervised learning framework**: - Based on sparse and noisy indication signals, train a segmentation network to gradually identify accurate areas. - The entire learning process follows the principles of progressive and prior - aware, effectively resisting the noise in the training signals. 3. **Experimental verification**: - The experimental results show that this unsupervised method can achieve results comparable to supervised learning on multiple data sets. In particular, for the segmentation of the pupil and iris, the performance is close to 90%, and for the segmentation of the entire eye area, the performance also reaches 85%. ### Formula display To understand the key steps in the method more clearly, the following are several important formulas: - **Gradient calculation**: \[ G=\text{Sobel}(I) \] where \(I\in\mathbb{R}^{w\times h}\) is the input image, and \(G\in\mathbb{R}^{w\times h\times2}\) is the calculated gradient map. - **Angle condition**: \[ \cos\theta_{i}=\frac{\mathbf{g}_{i}\cdot\mathbf{v}_{i}}{\|\mathbf{g}_{i}\|\|\mathbf{v}_{i}\|}>0 \] where \(\mathbf{g}_{i}\) and \(\mathbf{v}_{i}\) are the gradient vector of pixel \(p_{i}\) and the vector from the center point \(p_{o}\) to \(p_{i}\), respectively. - **Gradient retention rule**: \[ \hat{\mathbf{g}}_{j}=\mathbf{g}_{j}\cdot1_{\mathbb{R}^{+}}\left(\frac{1}{|k_{j}|}\sum_{p_{i}\in k_{j}}1_{\mathbb{R}^{+}}(\cos\theta_{i})-r_{th}\right) \] where \(1_{\mathbb{R}^{+}}(\cdot)\) is an indicator function used to determine whether to retain the gradient.

Towards Unsupervised Eye-Region Segmentation for Eye Tracking

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

CondSeg: Ellipse Estimation of Pupil and Iris via Conditioned Segmentation

Shape Constrained Network for Eye Segmentation in the Wild

Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM)

Learning Unsupervised Video Object Segmentation Through Visual Attention

Self-supervised pre-training for joint optic disc and cup segmentation via attention-aware network

Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images

Unsupervised Salient Object Segmentation From Color Images

Eye-Gaze Tracking Research Based on Image Processing

Eyenet: Attention based Convolutional Encoder-Decoder Network for Eye Region Segmentation

A new eye segmentation method based on improved U2Net in TCM eye diagnosis

Eye-UNet: a UNet-based network with attention mechanism for low-quality human eye image segmentation

Semi-supervised contrast learning-based segmentation of choroidal vessel in optical coherence tomography images

RSAP-Net: joint optic disc and cup segmentation with a residual spatial attention path module and MSRCR-PT pre-processing algorithm

Segment Anything without Supervision

Unsupervised Video Object Segmentation with Joint Hotspot Tracking

Gaze Estimation with Eye Region Segmentation and Self-Supervised Multistream Learning

Reconstruction-driven Dynamic Refinement based Unsupervised Domain Adaptation for Joint Optic Disc and Cup Segmentation

Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck