EEG-Derived Voice Signature for Attended Speaker Detection

Hongxu Zhu,Siqi Cai,Yidi Jiang,Qiquan Zhang,Haizhou Li
2023-08-28
Abstract:\textit{Objective:} Conventional EEG-based auditory attention detection (AAD) is achieved by comparing the time-varying speech stimuli and the elicited EEG signals. However, in order to obtain reliable correlation values, these methods necessitate a long decision window, resulting in a long detection latency. Humans have a remarkable ability to recognize and follow a known speaker, regardless of the spoken content. In this paper, we seek to detect the attended speaker among the pre-enrolled speakers from the elicited EEG signals. In this manner, we avoid relying on the speech stimuli for AAD at run-time. In doing so, we propose a novel EEG-based attended speaker detection (E-ASD) task. \textit{Methods:} We encode a speaker's voice with a fixed dimensional vector, known as speaker embedding, and project it to an audio-derived voice signature, which characterizes the speaker's unique voice regardless of the spoken content. We hypothesize that such a voice signature also exists in the listener's brain that can be decoded from the elicited EEG signals, referred to as EEG-derived voice signature. By comparing the audio-derived voice signature and the EEG-derived voice signature, we are able to effectively detect the attended speaker in the listening brain. \textit{Results:} Experiments show that E-ASD can effectively detect the attended speaker from the 0.5s EEG decision windows, achieving 99.78\% AAD accuracy, 99.94\% AUC, and 0.27\% EER. \textit{Conclusion:} We conclude that it is possible to derive the attended speaker's voice signature from the EEG signals so as to detect the attended speaker in a listening brain. \textit{Significance:} We present the first proof of concept for detecting the attended speaker from the elicited EEG signals in a cocktail party environment. The successful implementation of E-ASD marks a non-trivial, but crucial step towards smart hearing aids.
Audio and Speech Processing,Sound,Signal Processing,Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use electroencephalogram (EEG) signals to detect the specific speaker that the listener is paying attention to in a complex multi - speaker environment. Traditional methods rely on comparing time - varying speech stimuli with the induced EEG signals to achieve auditory attention detection (AAD), but this method requires a long decision window to obtain reliable correlation values, resulting in a long detection delay. In addition, these methods perform poorly in noisy or chaotic environments because they rely on available speech stimuli. This paper proposes a new EEG - based AAD task - E - ASD (EEG - based Attended Speaker Detection), which aims to directly detect the identity of the attended speaker from time - varying EEG signals. Specifically, the authors assume that there is a voice signature in the listener's brain that can be decoded from the induced EEG signals, and this voice signature corresponds to the voice characteristics of the speaker, regardless of the content of their speech. By comparing the audio - derived voice signature and the EEG - derived voice signature, the speaker that the listener is paying attention to can be effectively detected. To verify this hypothesis, the authors designed a network consisting of three main modules: a Speaker Encoder, a Brain Encoder, and an Attended Speaker Detector. The Speaker Encoder converts the speaker's voice into a fixed - dimensional vector, namely Speaker Embedding, and further projects it onto the audio - derived voice signature. The Brain Encoder decodes the voice signature of the attended speaker from the EEG signals, that is, the EEG - derived voice signature. Finally, the Attended Speaker Detector estimates the matching probability between multiple possible pairs of audio - derived and EEG - derived voice signatures and selects the speaker with the highest matching probability as the attended speaker. The experimental results show that E - ASD can effectively detect the attended speaker within an EEG decision window of 0.5 seconds, achieving an AAD accuracy of 99.78%, an AUC of 99.94%, and an EER of 0.27%. This marks an important progress in the field of intelligent hearing aids, especially in the ability to detect the attended speaker in a cocktail - party environment.