Junbo Chen,XUPENG CHEN,Ran Wang,Chenqian Le,Amirhossein Khalilian-Gourtani,Erika Jensen,Patricia Dugan,Werner Doyle,Orrin Devinsky,Daniel Friedman,Adeen Flinker,Yao Wang
Abstract:Objective: This study investigates speech decoding from neural signals captured by intracranial electrodes. Most prior works can only work with electrodes on a 2D grid (i.e., Electrocorticographic or ECoG array) and data from a single patient. We aim to design a deep-learning model architecture that can accommodate both surface (ECoG) and depth (stereotactic EEG or sEEG) electrodes. The architecture should allow training on data from multiple participants with large variability in electrode placements and the trained model should perform well on participants unseen during training.
Approach: We propose a novel transformer-based model architecture named SwinTW that can work with arbitrarily positioned electrodes, by leveraging their 3D locations on the cortex rather than their positions on a 2D grid. We train both subject-specific models using data from a single participant as well as multi-patient models exploiting data from multiple participants.
Main Results: The subject-specific models using only low-density 8x8 ECoG data achieved high decoding Pearson Correlation Coefficient with ground truth spectrogram (PCC=0.817), over N=43 participants, outperforming our prior convolutional ResNet model and the 3D Swin transformer model. Incorporating additional strip, depth, and grid electrodes available in each participant (N=39) led to further improvement (PCC=0.838). For participants with only sEEG electrodes (N=9), subject-specific models still enjoy comparable performance with an average PCC=0.798. The multi- subject models achieved high performance on unseen participants, with an average PCC=0.765 in leave-one-out cross-validation.
Significance: The proposed SwinTW decoder enables future speech neuropros- theses to utilize any electrode placement that is clinically optimal or feasible for a particular participant, including using only depth electrodes, which are more routinely implanted in chronic neurosurgical procedures. Importantly, the generalizability of the multi-patient models suggests the exciting possibility of developing speech neuropros- theses for people with speech disability without relying on their own neural data for training, which is not always feasible.
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to develop a deep - learning model architecture capable of decoding speech from cortical and deep electrode signals at any location. Specifically, most of the existing research can only handle electrodes on a two - dimensional grid (such as electrocorticography or ECoG arrays), and is usually limited to data from a single patient. The goal of this paper is to design a deep - learning model architecture that is not only compatible with surface (ECoG) and deep (stereoelectroencephalography or sEEG) electrodes, but can also be trained on data from multiple participants with a wide range of electrode position differences, and the trained model can perform well on new participants not involved in the training.
### Main Contributions
1. **Innovation in Model Architecture**:
- A new model architecture based on the Transformer, named SwinTW (Swin transformer with temporal windowing), is proposed. This model processes electrode signals at any location by leveraging the 3D positions of electrodes on the cerebral cortex rather than their positions on a two - dimensional grid.
- The model can handle subject - specific models from a single participant and multi - patient models using data from multiple participants.
2. **Performance Improvement**:
- The subject - specific model using low - density 8x8 ECoG data achieved a decoding Pearson correlation coefficient (PCC = 0.817) highly correlated with the real spectrogram among 43 participants, outperforming the previous convolutional ResNet model and the 3D Swin Transformer model.
- After combining the strip, depth, and grid electrodes available for each participant (39 participants), the performance is further improved (PCC = 0.838).
- For participants with only sEEG electrodes (9 participants), the subject - specific model still shows considerable performance (average PCC = 0.798).
- The multi - subject model also shows high performance on unseen participants in leave - one - out cross - validation (average PCC = 0.765).
3. **Potential for Clinical Applications**:
- The proposed SwinTW decoder enables future speech neuroprostheses to utilize any clinically optimal or feasible electrode placement, including using only deep electrodes, which are more routinely implanted in chronic neurosurgical operations.
- The generalization ability of the multi - patient model indicates that this model can be applied to new patients without paired acoustic and neural data, providing advanced neuroprostheses for people with speech disorders, especially when acoustic - neural training data are unavailable.
### Method Overview
1. **Speech Decoding Framework**:
- A two - step training method is adopted: First, speech - to - speech training is carried out. A speech encoder is used to extract speech parameters per frame from the input speech spectrogram, and then a differentiable speech synthesizer is used to reconstruct the spectrogram from the speech parameters. The speech encoder and the speech synthesizer are jointly trained to match the reconstructed spectrogram with the real value.
- Second, training from neural signals to speech is carried out. The neural decoder is trained to predict time - varying speech parameters from neural signals. These predicted speech parameters are input into the trained speech synthesizer to generate a predicted speech spectrogram, which is then converted into a predicted speech waveform.
2. **Neural Decoder Based on Temporal Swin Transformer**:
- **Time - slice Partitioning**: For an ECoG signal with a shape of \(T\times N\) (\(T\) is the number of frames, \(N\) is the number of electrodes), the time - series neural activity of each electrode is partitioned into \(\frac{T}{W}\) time - slices, each with a size of \(W\). This generates \(\frac{T}{W}\times N\) time - slices, and each time - slice is mapped to a \(C\) - dimensional token through a linear embedding layer.
- **Temporal Window Attention**: In the Swin Transformer, tokens are partitioned into windows, each containing locally adjacent tokens, and attention is calculated only between tokens within the same window. In SwinTW, the model partitions tokens into local windows only in the temporal dimension and allows spatial attention across all electrodes.
- **Time - slice Merging**: The Swin Transformer realizes local inductive bias and hierarchical feature maps through slice merging. Since the electrodes are not arranged in a grid.