S4Former: A Spectral–Spatial Sparse Selection Transformer for Multispectral Remote Sensing Scene Classification

Nan Wu,Jiezhi Lv,Wei Jin
DOI: https://doi.org/10.1109/lgrs.2024.3365509
IF: 5.343
2024-02-23
IEEE Geoscience and Remote Sensing Letters
Abstract:Remote sensing scene classification (RSSC) with multispectral images (MSIs) is a fundamental task for numerous applications in remote sensing. Vision transformers (ViTs), aided by strong multihead self-attention (MSA) mechanism, achieved the state-of-the-art benchmarks in capturing global information, and performance is further advanced by improved ViT-based models in RSSC. However, existing methods often overlook the sparsity and interconnections among the spectral, spatial, and semantic aspects, leading to token over-smoothing and increased data and computational demands. As an efficient optimization tool, sparse representation (SR) emphasizes key features in high-dimensional data via a set of templates, a.k.a. dictionary. In this letter, we present the spectral–spatial sparse selection transformer (S4Former), an SR-based network for token sparse modeling with two key upgrades to ViT's MSA and feed-forward layers: multispectral subspace local token selection (MSLS) and sparse-iterative global token optimization (SGTO). Specifically, MSLS maps multispectral data into different subspaces and compresses information through intraspectral self-attention, thereby increasing token diversity and discriminability. SGTO employs deep unfolding of an iterative shrinkage-thresholding algorithm (ISTA)-like iterative optimization to model the SR optimization process, progressively highlighting key features and semantically relevant sparse responses to avoid token over-smoothing. Validation on two multispectral datasets with limited training samples reveals the efficacy of S4Former. With 5% and 10% training data, S4Former achieved 91.46% and 92.84% accuracy on EuroSAT, and 91.55% and 94.41% on Nasc-TG2, respectively. Interestingly, by visualizing the top-layer attention maps of S4Former for different spectral combinations, we demonstrate that S4Former exhibits intuitive interpretability in capturing complementary and semantically consistent spectral information.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?