Abstract:We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface material changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "GazeGen: Gaze - based User Interaction for Generating Visual Content" aims to solve the following problems: 1. **Intuitive and accessible visual content editing**: - Traditional visual content editing interfaces usually rely on physical operations, which may be restrictive for users with physical disabilities. The paper proposes an eye - tracking - based system called GazeGen, which generates and edits images and videos through users' gaze points, thus achieving hands - free interaction and improving user participation and accessibility. 2. **Real - time and high - precision gaze point estimation**: - Existing gaze point estimation models are usually large in size and difficult to implement real - time processing on edge devices. GazeGen has developed an ultra - lightweight model DFT Gaze through knowledge distillation and adaptive techniques. This model contains only 281,000 parameters and can achieve real - time and accurate gaze point prediction on small edge devices. 3. **Multi - task visual content generation**: - The paper proposes a comprehensive framework that uses users' gaze points for various visual content generation tasks, such as adding, deleting, and relocating image objects, and converting static images into videos. These tasks require not only efficient gaze point estimation but also advanced object detection and generative AI techniques. 4. **Personalization and adaptability**: - Since everyone has different eye shapes and movement patterns, personalization is the key to achieving high - precision gaze point estimation. GazeGen enables the DFT Gaze model to adapt to different users' eye movement patterns through fine - tuning with a small number of user - specific samples, ensuring high precision and ease of use. 5. **Wide range of application scenarios**: - The paper demonstrates the wide applicability of GazeGen in various application scenarios, including design, entertainment, education, etc. By combining advanced object detection and generative AI methods, GazeGen can simplify complex tasks and make visual content creation more intuitive and efficient. ### Summary By combining efficient gaze point estimation techniques and generative AI, GazeGen provides a new standard that enables users to generate and edit visual content through simple gaze actions. This system not only improves the intuitiveness and accessibility of user interaction but also expands the application range of visual content generation, making it more extensive and effective.

GazeGen: Gaze-Driven User Interaction for Visual Content Generation

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

PerimetryNet: A Multiscale Fine Grained Deep Network for Three-Dimensional Eye Gaze Estimation Using Visual Field Analysis

TextGaze: Gaze-Controllable Face Generation with Natural Language

DGaze: CNN-Based Gaze Prediction in Dynamic Scenes.

GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting

Gaze Gestures and Their Applications in human-computer interaction with a head-mounted display

DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Gaze Target Estimation inspired by Interactive Attention

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

GazeFusion: Saliency-guided Image Generation

RealtimeGen: an Intervenable AI Image Generation System for Commercial Digital Art Asset Creators

Instant interaction driven adaptive gaze control interface

GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear

Low-cost Geometry-based Eye Gaze Detection using Facial Landmarks Generated through Deep Learning

Gaze Generation for Avatars Using GANs

GazeDirector: Fully Articulated Eye Gaze Redirection in Video

FreeGaze: Resource-efficient Gaze Estimation via Frequency Domain Contrastive Learning

A Generalized and Robust Method Towards Practical Gaze Estimation on Smart Phone

Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach

3DGazeNet: Generalizing Gaze Estimation with Weak-Supervision from Synthetic Views