The Multiscale Surface Vision Transformer

Simon Dahan,Logan Z. J. Williams,Daniel Rueckert,Emma C. Robinson
2024-06-11
Abstract:Surface meshes are a favoured domain for representing structural and functional information on the human cortex, but their complex topology and geometry pose significant challenges for deep learning analysis. While Transformers have excelled as domain-agnostic architectures for sequence-to-sequence learning, the quadratic cost of the self-attention operation remains an obstacle for many dense prediction tasks. Inspired by some of the latest advances in hierarchical modelling with vision transformers, we introduce the Multiscale Surface Vision Transformer (MS-SiT) as a backbone architecture for surface deep learning. The self-attention mechanism is applied within local-mesh-windows to allow for high-resolution sampling of the underlying data, while a shifted-window strategy improves the sharing of information between windows. Neighbouring patches are successively merged, allowing the MS-SiT to learn hierarchical representations suitable for any prediction task. Results demonstrate that the MS-SiT outperforms existing surface deep learning methods for neonatal phenotyping prediction tasks using the Developing Human Connectome Project (dHCP) dataset. Furthermore, building the MS-SiT backbone into a U-shaped architecture for surface segmentation demonstrates competitive results on cortical parcellation using the UK Biobank (UKB) and manually-annotated MindBoggle datasets. Code and trained models are publicly available at <a class="link-external link-https" href="https://github.com/metrics-lab/surface-vision-transformers" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Computer Vision and Pattern Recognition,Neurons and Cognition
What problem does this paper attempt to address?
This paper mainly discusses how to use the Transformer architecture to address the challenges of deep learning analysis when dealing with brain cortex surface data with complex topology and geometry. The authors propose a new method called Multiscale Surface Vision Transformer (MS-SiT), which is a backbone network applicable to surface deep learning. Inspired by Swin Transformer, MS-SiT allows high-resolution data sampling by applying self-attention mechanism in local grid windows, while improving information sharing between windows through the shift window strategy to reduce computational costs and maintain modeling of long-range dependencies. The main problem mentioned in the paper is that the computational cost of the global self-attention operation in the standard Transformer increases quadratically with the sequence length, which limits the model's ability to capture fine details and directly apply to dense prediction tasks. To solve this problem, MS-SiT adopts a hierarchical structure and gradually merges adjacent patches to learn hierarchical representations suitable for various prediction tasks. Experimental results show that MS-SiT outperforms existing surface deep learning methods in neonatal phenotype prediction tasks and performs well on the Developing Human Connectome Project (dHCP) dataset. Furthermore, when constructed as a U-shaped architecture for cortical segmentation, it achieves competitive results compared to existing methods on the UK Biobank and manually annotated MindBoggle datasets. In summary, this paper aims to address the efficiency and accuracy issues in deep learning when dealing with brain cortex surface data due to data complexity. By introducing MS-SiT, it improves the analysis capability of structural and functional information.