Abstract:In the past three years, there has been significant interest in hyperspectral imagery (HSI) classification using vision Transformers for the analysis of remotely sensed data. Previous research predominantly focused on the empirical integration of convolutional neural networks (CNNs) to augment the network's capability to extract local feature information. Yet, the theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question. To address this issue, a unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification, is investigated. In this streamlined yet effective vision Transformer architecture, multiple mixer modules are strategically integrated separately. These include the CNN mixer, which executes convolutional operations; the spatial self-attention (SSA) mixer and channel self-attention (CSA) mixer, both of which are adaptations of classical self-attention blocks; and hybrid models, such as the SSA + CNN mixer and CSA + CNN mixer, which merge convolution with self-attention operations. This integration facilitates the development of a broad spectrum of vision Transformer-based models tailored for HSI classification. In terms of the training process, a comprehensive analysis is performed, contrasting classical CNN models and vision Transformer-based counterparts, with particular attention to disturbance robustness and the distribution of the largest eigenvalue of the Hessian. From the evaluations conducted on various mixer models rooted in the unified architecture, it is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture, rather than being exclusively reliant on individual multihead self-attention (MSA) components. Extensive experiments demonstrate that the derived vision Transformer models, based on the unified architecture, surpass the classical methods when applied to multiple hyperspectral benchmark datasets.

HTD-VIT: Spectral-Spatial Joint Hyperspectral Target Detection with Vision Transformer

Iterative Autoencoder Coupling with Constrained Energy Minimization for Hyperspectral Target Detection

Cross-Domain Hyperspectral Image Classification Based on Transformer

Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

CS2DT: Cross Spatial–Spectral Dense Transformer for Hyperspectral Image Classification

Vision Transformer-Based Ensemble Learning for Hyperspectral Image Classification

Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network

Hierarchical Attention Transformer for Hyperspectral Image Classification

Discriminative Vision Transformer for Heterogeneous Cross-Domain Hyperspectral Image Classification

Hybrid Conv-ViT Network for Hyperspectral Image Classification

3D-Convolution Guided Spectral-Spatial Transformer for Hyperspectral Image Classification

A Spatial–Spectral Transformer for Hyperspectral Image Classification Based on Global Dependencies of Multi-Scale Features

Transfer Learning of Spatial Features From High-Resolution RGB Images for Large-Scale and Robust Hyperspectral Remote Sensing Target Detection

Spectral–Spatial–Temporal Transformers for Hyperspectral Image Change Detection

Hyperspectral Video Target Tracking based on Pixel-wise Spectral Matching Reduction and Deep Spectral Cascading Texture Features

HTD-Mamba: Efficient Hyperspectral Target Detection with Pyramid State Space Model

Multiview Transformer: Rethinking Spatial Information in Hyperspectral Image Classification

Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

Transformer-Driven Inverse Problem Transform for Fast Blind Hyperspectral Image Dehazing

Wavelet Tree Transformer: Multihead Attention With Frequency-Selective Representation and Interaction for Remote Sensing Object Detection