Abstract:Object: Transformer-based neural networks have been applied to the electroencephalography (EEG) decoding for motor imagery (MI). However, most networks focus on applying the self-attention mechanism to extract global temporal information, while the cross-frequency coupling features between different frequencies have been neglected. Additionally, effectively integrating different neural networks poses challenges for the advanced design of decoding algorithms. Methods: This study proposes a novel end-to-end Multi-Scale Vision Transformer Neural Network (MSVTNet) for MI-EEG classification. MSVTNet first extracts local spatio-temporal features at different filtered scales through convolutional neural networks (CNNs). Then, these features are concatenated along the feature dimension to form local multi-scale spatio-temporal feature tokens. Finally, Transformers are utilized to capture cross-scale interaction information and global temporal correlations, providing more distinguishable feature embeddings for classification. Moreover, auxiliary branch loss is leveraged for intermediate supervision to ensure the effective integration of CNNs and Transformers. Results: The performance of MSVTNet was assessed through subject-dependent (session-dependent and session-independent) and subject-independent experiments on three MI datasets, i.e., the BCI competition IV 2a, 2b and OpenBMI datasets. The experimental results demonstrate that MSVTNet achieves state-of-the-art performance in all analyses. Conclusion: MSVTNet shows superiority and robustness in enhancing MI decoding performance. The source code for MSVTNet is available at https://github.com/SheepTAO/MSVTNet.

MSVTNet: Multi-Scale Vision Transformer Neural Network for EEG-Based Motor Imagery Decoding