Abstract:\textit{Objective:} Conventional EEG-based auditory attention detection (AAD) is achieved by comparing the time-varying speech stimuli and the elicited EEG signals. However, in order to obtain reliable correlation values, these methods necessitate a long decision window, resulting in a long detection latency. Humans have a remarkable ability to recognize and follow a known speaker, regardless of the spoken content. In this paper, we seek to detect the attended speaker among the pre-enrolled speakers from the elicited EEG signals. In this manner, we avoid relying on the speech stimuli for AAD at run-time. In doing so, we propose a novel EEG-based attended speaker detection (E-ASD) task. \textit{Methods:} We encode a speaker's voice with a fixed dimensional vector, known as speaker embedding, and project it to an audio-derived voice signature, which characterizes the speaker's unique voice regardless of the spoken content. We hypothesize that such a voice signature also exists in the listener's brain that can be decoded from the elicited EEG signals, referred to as EEG-derived voice signature. By comparing the audio-derived voice signature and the EEG-derived voice signature, we are able to effectively detect the attended speaker in the listening brain. \textit{Results:} Experiments show that E-ASD can effectively detect the attended speaker from the 0.5s EEG decision windows, achieving 99.78\% AAD accuracy, 99.94\% AUC, and 0.27\% EER. \textit{Conclusion:} We conclude that it is possible to derive the attended speaker's voice signature from the EEG signals so as to detect the attended speaker in a listening brain. \textit{Significance:} We present the first proof of concept for detecting the attended speaker from the elicited EEG signals in a cocktail party environment. The successful implementation of E-ASD marks a non-trivial, but crucial step towards smart hearing aids.

Listen to the Speaker in Your Gaze

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Target Active Speaker Detection with Audio-visual Cues

Speaker Extraction With Co-Speech Gestures Cue

Speaker Extraction with Detection of Presence and Absence of Target Speakers

EEG-Derived Voice Signature for Attended Speaker Detection

Selective Listening by Synchronizing Speech with Lips

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues

NeuroHeed: Neuro-Steered Speaker Extraction using EEG Signals

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications