Remote Sensing Scene Classification Based on Local Selection Vision Transformer

Yang Kai,Lu Xiaoqiang
DOI: https://doi.org/10.3788/lop230539
2023-01-01
Laser & Optoelectronics Progress
Abstract:Remote sensing scene classification aims to assign specific semantic labels to aerial images, which is a fundamental and important task in remote sensing image interpretation. Existing studies have used convolutional neural networks ( CNN) to learn global and local features and improve the discriminative representation of networks. However, the perceptual wilderness of CNN-based approaches has limitations in modeling the remote dependence of local features. In recent years, Vision Transformer (ViT) has shown powerful performances in traditional classification tasks. Its selfattention mechanism connects each Patch with a classification token and captures the contextual relationship between image pixels by considering global information in the spatial domain. In this paper, we propose a remote sensing scene classification network based on local selection ViT, in which an input image is first segmented into small chunks of Patch that are unfolded and converted into sequences with position encoding; thereafter, the obtained sequences are fed into an encoder. In addition, a local selection module is added before the last layer of input in order to learn the local discriminative features, and Token with discriminative properties are selected as input to obtain the final classification output. The experimental results show that the proposed method achieves good results on two large remote sensing scene classification datasets (AID and NWPU).
What problem does this paper attempt to address?