Abstract:Vision transformers (ViTs) are increasingly utilized for HSI classification due to their outstanding performance. However, ViTs encounter challenges in capturing global dependencies among objects of varying sizes, and fail to effectively exploit the spatial–spectral information inherent in HSI. In response to this limitation, we propose a novel solution: the multi-scale spatial–spectral transformer (MSST). Within the MSST framework, we introduce a spatial–spectral token generator (SSTG) and a token fusion self-attention (TFSA) module. Serving as the feature extractor for the MSST, the SSTG incorporates a dual-branch multi-dimensional convolutional structure, enabling the extraction of semantic characteristics that encompass spatial–spectral information from HSI and subsequently tokenizing them. TFSA is a multi-head attention module with the ability to encode attention to features across various scales. We integrated TFSA with cross-covariance attention (CCA) to construct the transformer encoder (TE) for the MSST. Utilizing this TE to perform attention modeling on tokens derived from the SSTG, the network effectively simulates global dependencies among multi-scale features in the data, concurrently making optimal use of spatial–spectral information in HSI. Finally, the output of the TE is fed into a linear mapping layer to obtain the classification results. Experiments conducted on three popular public datasets demonstrate that the MSST method achieved higher classification accuracy compared to state-of-the-art (SOTA) methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in hyperspectral image classification, existing methods are difficult to effectively capture the global dependency relationships between targets at different scales, and fail to fully utilize the spatial - spectral information of hyperspectral images. Specifically: 1. **Global Dependence of Multi - scale Features**: Traditional convolutional neural networks (CNNs) are limited by the fixed convolution kernel size when dealing with multi - scale targets, resulting in reduced reliability in classifying multi - scale targets. In addition, the monotonous sliding window mechanism makes these networks difficult to model the global dependency relationships between different image elements. 2. **Effective Utilization of Spatial - spectral Information**: Although existing Transformer - based methods can capture high - level semantic features, they usually use patches of a fixed scale to generate tokens required by the attention model, which largely ignores the multi - scale nature of targets in the image. Moreover, Transformer networks cannot directly utilize the rich spatial and spectral information in hyperspectral images, which also limits their classification accuracy. To solve these problems, the authors propose a new multi - scale spatial - spectral Transformer (MSST), aiming to enhance the Transformer's ability to model global dependencies on multi - scale features and fully mine the spatial - spectral features of hyperspectral images. The main contributions of MSST include: 1. **Redesigned Feature Extractor (SSTG)**: SSTG adopts a dense multi - dimensional convolution structure and can effectively extract the spatial - spectral features of hyperspectral images. In addition, it introduces a branch to extract the spectral features of query pixels to compensate for the spectral features damaged during the convolution process. In this way, SSTG can express the spatial - spectral semantic features of hyperspectral images during the classification process. 2. **Innovative Multi - scale Attention Mechanism (TFSA)**: The TFSA module generates keys and values of different sizes by fusing tokens at different scales. Subsequently, different attention heads model features at different scales in the attention layer. This novel attention mechanism can effectively simulate the global dependencies between multi - scale features and improve the classification ability of multi - scale targets. 3. **MSST Network Combined with CCA**: The authors combine SSTG and TFSA with cross - covariance attention (CCA) to construct the MSST hyperspectral image classification network. This hybrid network effectively integrates global and local modeling capabilities, can consider the multi - scale characteristics of targets in hyperspectral images, and effectively utilize spatial - spectral features. Through these innovations, the experimental results of MSST on multiple public datasets show that its classification accuracy is superior to existing state - of - the - art methods.

A Spatial–Spectral Transformer for Hyperspectral Image Classification Based on Global Dependencies of Multi-Scale Features

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

3D-Convolution Guided Spectral-Spatial Transformer for Hyperspectral Image Classification

Hyperspectral Image Classification Using Spectral–Spatial Token Enhanced Transformer with Hash-Based Positional Embedding

CS2DT: Cross Spatial–Spectral Dense Transformer for Hyperspectral Image Classification

MASSFormer: Memory-Augmented Spectral-Spatial Transformer for Hyperspectral Image Classification

Multilevel Class Token Transformer With Cross TokenMixer for Hyperspectral Images Classification

MultiScale Spectral-Spatial Convolutional Transformer for Hyperspectral Image Classification

Two‐branch global spatial–spectral fusion transformer network for hyperspectral image classification

Adaptive Learnable Spectral–Spatial Fusion Transformer for Hyperspectral Image Classification

Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network

Multimodal Fusion Transformer for Remote Sensing Image Classification

Global–Local 3-D Convolutional Transformer Network for Hyperspectral Image Classification

Vision Transformer with Super Token Sampling

Hierarchical Spectral–Spatial Transformer for Hyperspectral and Multispectral Image Fusion

Hierarchical Attention Transformer for Hyperspectral Image Classification

Hyperspectral Image Classification via Spectral Pooling and Hybrid Transformer

MSMT-LCL: Multiscale Spatial-Spectral Masked Transformer With Local Contrastive Learning for Hyperspectral Image Classification

MHST: Multiscale Head Selection Transformer for Hyperspectral and LiDAR Classification

A Dual-Branch Multiscale Transformer Network for Hyperspectral Image Classification