Single-word Auditory Attention Decoding Using Deep Learning Model

Nhan Duc Thanh Nguyen,Huy Phan,Kaare Mikkelsen,Preben Kidmose
2024-10-16
Abstract:Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD). The majority of AAD algorithms utilize the so-called envelope entrainment mechanism, whereby auditory attention is identified by how the envelope of the auditory stream drives variation in the electroencephalography (EEG) signal. However, neural processing can also be decoded based on endogenous cognitive responses, in this case, neural responses evoked by attention to specific words in a speech stream. This approach is largely unexplored in the field of AAD but leads to a single-word auditory attention decoding problem in which an epoch of an EEG signal timed to a specific word is labeled as attended or unattended. This paper presents a deep learning approach, based on EEGNet, to address this challenge. We conducted a subject-independent evaluation on an event-based AAD dataset with three different paradigms: word category oddball, word category with competing speakers, and competing speech streams with targets. The results demonstrate that the adapted model is capable of exploiting cognitive-related spatiotemporal EEG features and achieving at least 58% accuracy on the most realistic competing paradigm for the unseen subjects. To our knowledge, this is the first study dealing with this problem.
Signal Processing,Artificial Intelligence,Human-Computer Interaction,Sound,Audio and Speech Processing,Neurons and Cognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to decode Auditory Attention Decoding (AAD) through electroencephalogram (EEG) signals of single words in a multi - speaker environment. Specifically, the researchers explored deep - learning - based methods, especially using the EEGNet architecture, to identify whether a particular word was noticed by the subjects. This method is different from the traditional method based on audio envelope reconstruction, which mainly relies on neural responses caused by external stimuli. The method proposed in this paper focuses on endogenous cognitive responses, that is, event - related potentials (ERP) triggered by specific words in a multi - speaker environment, so as to achieve the classification of auditory attention of single words. ### Main Challenges: 1. **Small - scale and Imbalanced Dataset**: Due to the particularity of the experimental design, the dataset is relatively small and imbalanced (at a ratio of approximately 1:5), which poses challenges to model training. 2. **Generalization Ability of the Model**: The model is required to perform well on unseen subjects, especially in cross - paradigm situations. ### Solutions: 1. **Data Augmentation**: To overcome the problems of small and imbalanced datasets, the researchers proposed two data augmentation methods: - **Average Up - sampling**: New samples are generated by averaging random samples of each category to increase the amount and diversity of data. - **ERP Simulation**: New target samples are generated by adding target (noticed) ERP waveforms to non - target (unnoticed) samples, introducing more variability. 2. **Model Selection**: The lightweight EEGNet architecture was selected. This architecture can effectively extract spatial and temporal features and has fewer parameters, which is suitable for processing limited datasets. ### Experimental Results: - **Subject - Pool Performance**: In all three paradigms, the model using data augmentation significantly outperforms the model without data augmentation. Especially in Paradigm 1 and Paradigm 2, the paradigm - specific models perform better than the paradigm - independent models. - **Leave - One - Out Validation**: For unseen subjects, the model using data augmentation still performs well, although the performance drops slightly. This indicates that the model has a certain generalization ability. ### Conclusion: The research proves that through deep - learning methods, especially the EEGNet architecture, auditory attention can be effectively decoded from EEG signals of single words. The data augmentation strategy is crucial for improving model performance, especially in cases where the dataset is small and imbalanced. Future research can further explore how to combine endogenous and exogenous responses to improve the robustness and generalization ability of the model.