CSiT: A Multiscale Vision Transformer for Hyperspectral Image Classification.

Wenxuan He,Weiliang Huang,Shuhong Liao,Zhen Xu,Jingwen Yan
DOI: https://doi.org/10.1109/jstars.2022.3216335
IF: 4.715
2022-01-01
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Abstract:The hyperspectral image (HSI) has nearly continuous spectral information; thus, the target of interest can be accurately identified by the subtle details of spectral properties. Spectral resolution at different scales can capture different levels of spectral features: Small-scale spectral bands are beneficial for extracting global details in vision transformers, while large-scale spectral bands are more effective for local features. Transformer shows advantages in global information extraction with self-attention module and even surpasses convolutional neural network (CNNs) in various tasks. Some works based on the vision transformer have performed surprisingly in HSI classification. However, single-scale vision transformers are insufficient to balance the extraction of local details and redundancy on different scales. The recent work, a multiscale vision transformer, has provided a solution with spatial patch-wise features in image classification. Inspired by this, we propose the cross-spectral vision transformer (CSiT) with two branches to extract pixel-wise multiscale features and further design a multiscale spectral embedding module to enhance local details between neighboring spectral bands. Moreover, based on the cross-attention operation, a single token for each branch is recognized as a query and used to exchange information with other branches. We evaluate the classification performance of the proposed CSiT in three classic HSI datasets with extensive experiments, showing the multiscale vision transformer architecture has a promising result for HSI classification with 1-D spectral bands.
What problem does this paper attempt to address?