SCViT: A Spatial-Channel Feature Preserving Vision Transformer for Remote Sensing Image Scene Classification
Pengyuan Lv,Wenjun Wu,Yanfei Zhong,Fang Du,Liangpei Zhang
DOI: https://doi.org/10.1109/tgrs.2022.3157671
IF: 8.2
2022-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Convolutional neural network (CNN)-based methods are widely used in remote sensing image scene classification and can obtain excellent performances. However, the stacked receptive fields in the CNN-based methods have limitations in modeling the long-range dependencies of local features. The vision transformer (ViT) model provides a good solution as it directly considers the global interactions of local patches by the self-attention mechanism. However, the vanilla ViT model, which simply splits images into fixed-size patches treated as tokens, mainly considers the global information in the spatial domain. In this article, a spatial-channel feature preserving ViT (SCViT) model is proposed, which considers both the detailed geometric information of the high-spatial-resolution (HSR) imagery and the contribution of the different channels contained in the classification token. First, in the proposed method, tokens are generated by progressively aggregating the neighboring overlapping patches to extract the local structural features of the imagery. Second, a multihead self-attention (MSA) mechanism is used to model the global interactions of the tokens in the encoder. A lightweight channel attention (LCA) module is then introduced to consider the importance of the different channels in the classification token. Finally, a multilayer perceptron (MLP) is used to acquire the final results. Compared with the state-of-the-art scene classification methods, the experimental results confirm the potential of using ViT models in remote sensing image scene classification.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics