Abstract:Recently, vision transformer (ViT)-based deep learning (DL) models have achieved remarkable performance gains in hyperspectral image classification (HSIC) due to their abilities to model long-range dependencies and extract global spatial features. However, ViT is built with a stack of Transformer blocks and faces the challenge of learning a large number of parameters when processing hyperspectral data. Besides, the inherent modeling of global correlation in Transformer ignores the effective representation of local spatial and spectral features. To address these issues, we propose a lightweight ViT network known as groupwise separable convolutional ViT (GSC-ViT). First, a groupwise separable convolution (GSC) module, which is a combination of grouped pointwise convolution (GPWC) and group convolution, is designed to significantly decrease the number of convolutional kernel parameters, and effectively capture local spectral–spatial information in hyperspectral image (HSI). Second, a groupwise separable multihead self-attention (GSSA) module is employed to substitute the conventional multihead self-attention (MSA) in ViT, in which the groupwise self-attention (GSA) provides local spatial feature extraction, and the pointwise self-attention (PWSA) provides global spatial feature extraction. Third, a simple pointwise layer with enhanced skip connection mechanism is employed to substitute the multilayer perceptron (MLP) layer in all Transformer blocks of ViT, so as to eliminate unnecessary nonlinear transformations and facilitate the fusion of features derived from GSC and GSSA modules. Extensive experiments on four benchmark hyperspectral datasets reveal that our GSC-ViT can achieve surprising classification performance with relatively few training samples as compared with some existing HSIC approaches. The source code is available at https://github.com/flyzzie/TGRS-GSC-VIT.

ConVaT: A Variational Generative Transformer With Momentum Contrastive Learning for Hyperspectral Image Classification

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network

Learning a 3D-CNN and Convolution Transformers for Hyperspectral Image Classification

3D-Convolution Guided Spectral-Spatial Transformer for Hyperspectral Image Classification

A Lightweight 1-D Convolution Augmented Transformer with Metric Learning for Hyperspectral Image Classification

Vision Transformer With Contrastive Learning for Remote Sensing Image Scene Classification

Hybrid Conv-ViT Network for Hyperspectral Image Classification

Weighted residual self-attention graph-based transformer for spectral–spatial hyperspectral image classification

A Spatial–Spectral Transformer for Hyperspectral Image Classification Based on Global Dependencies of Multi-Scale Features

SVAFormer: Integrating Random and Hierarchical Spectral View Attention for Hyperspectral Image Classification

A multimodal hyper-fusion transformer for remote sensing image classification

MGCET: MLP-mixer and Graph Convolutional Enhanced Transformer for Hyperspectral Image Classification

Global–Local 3-D Convolutional Transformer Network for Hyperspectral Image Classification

Vision Transformer-Based Ensemble Learning for Hyperspectral Image Classification

SpectralFormer: Rethinking Hyperspectral Image Classification With Transformers

Hierarchical Attention Transformer for Hyperspectral Image Classification

Lite Vision Transformer with Enhanced Self-Attention

Hyperspectral Image Classification With Contrastive Graph Convolutional Network