GazeGen: Gaze-Driven User Interaction for Visual Content Generation

He-Yen Hsieh,Ziyun Li,Sai Qian Zhang,Wei-Te Mark Ting,Kao-Den Chang,Barbara De Salvo,Chiao Liu,H. T. Kung
2024-11-07
Abstract:We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface material changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "GazeGen: Gaze - based User Interaction for Generating Visual Content" aims to solve the following problems: 1. **Intuitive and accessible visual content editing**: - Traditional visual content editing interfaces usually rely on physical operations, which may be restrictive for users with physical disabilities. The paper proposes an eye - tracking - based system called GazeGen, which generates and edits images and videos through users' gaze points, thus achieving hands - free interaction and improving user participation and accessibility. 2. **Real - time and high - precision gaze point estimation**: - Existing gaze point estimation models are usually large in size and difficult to implement real - time processing on edge devices. GazeGen has developed an ultra - lightweight model DFT Gaze through knowledge distillation and adaptive techniques. This model contains only 281,000 parameters and can achieve real - time and accurate gaze point prediction on small edge devices. 3. **Multi - task visual content generation**: - The paper proposes a comprehensive framework that uses users' gaze points for various visual content generation tasks, such as adding, deleting, and relocating image objects, and converting static images into videos. These tasks require not only efficient gaze point estimation but also advanced object detection and generative AI techniques. 4. **Personalization and adaptability**: - Since everyone has different eye shapes and movement patterns, personalization is the key to achieving high - precision gaze point estimation. GazeGen enables the DFT Gaze model to adapt to different users' eye movement patterns through fine - tuning with a small number of user - specific samples, ensuring high precision and ease of use. 5. **Wide range of application scenarios**: - The paper demonstrates the wide applicability of GazeGen in various application scenarios, including design, entertainment, education, etc. By combining advanced object detection and generative AI methods, GazeGen can simplify complex tasks and make visual content creation more intuitive and efficient. ### Summary By combining efficient gaze point estimation techniques and generative AI, GazeGen provides a new standard that enables users to generate and edit visual content through simple gaze actions. This system not only improves the intuitiveness and accessibility of user interaction but also expands the application range of visual content generation, making it more extensive and effective.