Jiangfan Deng,Zhuang Jia,Zhaoxue Wang,Xiang Long,Daniel K. Du
Abstract:Finding the eye and parsing out the parts (e.g. pupil and iris) is a key prerequisite for image-based eye tracking, which has become an indispensable module in today's head-mounted VR/AR devices. However, a typical route for training a segmenter requires tedious handlabeling. In this work, we explore an unsupervised way. First, we utilize priors of human eye and extract signals from the image to establish rough clues indicating the eye-region structure. Upon these sparse and noisy clues, a segmentation network is trained to gradually identify the precise area for each part. To achieve accurate parsing of the eye-region, we first leverage the pretrained foundation model Segment Anything (SAM) in an automatic way to refine the eye indications. Then, the learning process is designed in an end-to-end manner following progressive and prior-aware principle. Experiments show that our unsupervised approach can easily achieve 90% (the pupil and iris) and 85% (the whole eye-region) of the performances under supervised learning.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when performing eye - tracking based on images, how to achieve the segmentation of the eye area (including parts such as the pupil and iris) through an unsupervised learning method. Specifically, the author aims to avoid the cumbersome process of requiring a large amount of manually - annotated data in traditional methods, thereby improving the efficiency and adaptability of model training, especially in the case of rapid hardware iteration.
### Background of the paper and problem description
1. **Importance of eye - tracking**:
- Eye - tracking technology has become increasingly important in recent years, especially after being integrated into VR/AR devices. It can provide valuable information about the user's visual process and reveal the user's intentions and behaviors.
- This information can be widely applied in multiple fields, such as gaze - based rendering, medical diagnosis, remote support, etc., and has the potential to revolutionize human - computer interaction.
2. **Limitations of traditional methods**:
- Traditional methods for eye - area segmentation rely on a large number of manually - annotated data sets, which are not only time - consuming and labor - intensive but also very inefficient in the face of rapid hardware updates.
- Manually annotating pixel - level masks is a labor - intensive task and difficult to meet the needs of practical applications.
3. **Research motivation**:
- To solve the above problems, the author explored an unsupervised learning method for eye - area segmentation. This method utilizes prior knowledge of the human eye and low - level feature signals in the image, thereby reducing the dependence on manually - annotated data.
### Overview of the solution
The method proposed by the author mainly includes the following steps:
1. **Extract rough cues using prior knowledge and image signals**:
- Utilize the brightness change law of the human eye (the brightness gradually increases from the pupil to the iris and then to the sclera), and calculate the gradient to initially locate the boundaries of the pupil and iris.
- Use the pre - trained base model Segment Anything (SAM) to automatically refine these rough indication signals.
2. **Design an end - to - end unsupervised learning framework**:
- Based on sparse and noisy indication signals, train a segmentation network to gradually identify accurate areas.
- The entire learning process follows the principles of progressive and prior - aware, effectively resisting the noise in the training signals.
3. **Experimental verification**:
- The experimental results show that this unsupervised method can achieve results comparable to supervised learning on multiple data sets. In particular, for the segmentation of the pupil and iris, the performance is close to 90%, and for the segmentation of the entire eye area, the performance also reaches 85%.
### Formula display
To understand the key steps in the method more clearly, the following are several important formulas:
- **Gradient calculation**:
\[
G=\text{Sobel}(I)
\]
where \(I\in\mathbb{R}^{w\times h}\) is the input image, and \(G\in\mathbb{R}^{w\times h\times2}\) is the calculated gradient map.
- **Angle condition**:
\[
\cos\theta_{i}=\frac{\mathbf{g}_{i}\cdot\mathbf{v}_{i}}{\|\mathbf{g}_{i}\|\|\mathbf{v}_{i}\|}>0
\]
where \(\mathbf{g}_{i}\) and \(\mathbf{v}_{i}\) are the gradient vector of pixel \(p_{i}\) and the vector from the center point \(p_{o}\) to \(p_{i}\), respectively.
- **Gradient retention rule**:
\[
\hat{\mathbf{g}}_{j}=\mathbf{g}_{j}\cdot1_{\mathbb{R}^{+}}\left(\frac{1}{|k_{j}|}\sum_{p_{i}\in k_{j}}1_{\mathbb{R}^{+}}(\cos\theta_{i})-r_{th}\right)
\]
where \(1_{\mathbb{R}^{+}}(\cdot)\) is an indicator function used to determine whether to retain the gradient.