Listen to the Speaker in Your Gaze

Hongli Yang,Xinyi Chen,Junjie Li,Hao Huang,Siqi Cai,Haizhou Li
DOI: https://doi.org/10.1109/cis-ram61939.2024.10672879
2024-01-01
Abstract:Attending to one’s voice in a cocktail party is notably challenging, particularly for individuals with hearing impairments. This paper proposes a novel eye-controlled target speaker extraction system, which consists of an eye-tracker, face detection model, Active Speaker Detection (ASD), and Target Speaker Extraction (TSE) model. The system employs the eye-tracker to capture real-time video together with the listener’s gaze. This gaze data then allows the face detection model to locate and isolate the target speaker’s face within the video on a frame-by-frame basis. Using the speaker’s face as the reference cue, the system can discern and separate his/her speech from a mixture of multi-talk. The experiments show that the system effectively extracts the target speaker’s speech in complex auditory environments, providing both real-time performance and accuracy. A demonstration of our system is available on our website 1 .
What problem does this paper attempt to address?