Abstract:Hyperspectral images (HSIs) contain abundant information in the spatial and spectral domains, allowing for a precise characterization of categories of materials. Convolutional neural networks (CNNs) have achieved great success in HSI classification, owing to their excellent ability in local contextual modeling. However, CNNs suffer from fixed filter weights and deep convolutional layers, which lead to a limited receptive field and high computational burden. The recent vision transformer (ViT) models long-range dependencies with a self-attention mechanism and has been an alternative backbone to CNNs traditionally used in HSI classification. However, such transformer-based architectures designate all the input pixels of the receptive field as feature tokens in terms of feature embedding and self-attention, which inevitably limits the ability for learning multiscale features and increases the computational cost. To overcome this issue, we propose a local semantic feature aggregation-based transformer (LSFAT) architecture which allows transformers to represent long-range dependencies of multiscale features more efficiently. We introduce the concept of the homogeneous region into the transformer by considering a pixel aggregation strategy and further propose neighborhood-aggregation-based embedding (NAE) and attention (NAA) modules, which are able to adaptively form multiscale features and capture locally spatial semantics among them in a hierarchical transformer architecture. A reusable classification token is included together with the feature tokens in the attention calculation. In the last stage, a fully connected layer is used to perform classification on the reusable token after transformer encoding. We verify the effectiveness of the NAE and NAA modules compared with the traditional ViT through extensive experiments. Our results demonstrate the excellent classification performance of the proposed method in comparison to other state-of-the-art approaches on several public HSIs.

Hybrid Vision Transformer Model for Hyperspectral Image Classification.

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

Hybrid Conv-ViT Network for Hyperspectral Image Classification

Tripartite‐structure transformer for hyperspectral image classification

Hyperspectral Image Classification Using Hierarchical Spatial-Spectral Transformer

Joint Multi-Scale CNN and Vision Transformer for Hyperspectral Image Classification

CNN and Transformer Hybrid Network for Hyperspectral Image Classification

Multiple Vision Architectures-Based Hybrid Network for Hyperspectral Image Classification

Hyperspectral Image Classification Using Group-Aware Hierarchical Transformer

Cross-Domain Hyperspectral Image Classification Based on Transformer

A Locally Enhanced Transformer Network for Hyperspectral Image Classification

LGFormer: Local-to-Global Transformer for Hyperspectral Image Classification

Convolution Transformer Fusion Splicing Network for Hyperspectral Image Classification

Hierarchical Attention Transformer for Hyperspectral Image Classification

CNN and Transformer interaction network for hyperspectral image classification

Multi-granularity Vision Transformer Via Semantic Token for Hyperspectral Image Classification

Hyperspectral Image Transformer Classification Networks

Local Semantic Feature Aggregation-Based Transformer for Hyperspectral Image Classification

Convolution-Transformer Adaptive Fusion Network for Hyperspectral Image Classification

Multiattention Joint Convolution Feature Representation with Lightweight Transformer for Hyperspectral Image Classification.

A hybrid convolution transformer for hyperspectral image classification