Decoding speech perception from non-invasive brain recordings

Alexandre Défossez,Charlotte Caucheteux,Jérémy Rapin,Ori Kabeli,Jean-Rémi King
DOI: https://doi.org/10.1038/s42256-023-00714-5
2023-10-05
Abstract:Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in that regard: deep learning algorithms trained on intracranial recordings now start to decode elementary linguistic features (e.g. letters, words, spectrograms). However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here, we introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto- or electro-encephalography (M/EEG), while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and more than 80% in the very best participants - a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model to a variety of baselines highlights the importance of (i) a contrastive objective, (ii) pretrained representations of speech and (iii) a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder's predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk for brain surgery.
Audio and Speech Processing,Artificial Intelligence,Machine Learning,Neurons and Cognition
What problem does this paper attempt to address?
This paper aims to address the problem of decoding speech perception from non-invasive brain recordings. Specifically, the goal of the research is to decode the brain's perception of speech from non-invasive brain recordings (such as MEG and EEG) of healthy individuals without performing invasive surgery. Currently, most methods rely on invasive devices to achieve high-precision speech decoding, but these methods require brain surgery and are difficult to maintain signal quality over the long term. Therefore, this study proposes a model based on contrastive learning training, which extracts deep representations from large-scale speech data through self-supervised learning and applies them to non-invasive brain recordings of healthy volunteers to identify speech segments perceived auditorily. The researchers integrated four publicly available datasets, containing MEG and EEG recordings of 175 participants while listening to stories or isolated sentences. Experimental results show that the model can identify corresponding speech segments from 3-second MEG signals with an accuracy of up to 41%, and in the best participants, this accuracy even exceeds 80%. Additionally, the model can decode words and phrases that did not appear in the training set. The study also highlights the importance of contrastive learning objectives, pre-trained speech representations, and convolutional architectures trained across multiple participants. Overall, this research demonstrates the potential for effectively decoding speech perception from non-invasive brain recordings, providing new insights for the future development of non-invasive brain-computer interfaces.