Abstract:Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head self-attention (MSA) mechanism, visual transformer (ViT)-based architectures have shown excellent capability in natural scene image classification. However, in order to achieve powerful RSIC performance, it is insufficient to capture global spatial information alone. Specifically, for fine-grained target recognition tasks with high inter-class similarity, discriminative and effective local feature representations are key to correct classification. In addition, due to the lack of inductive biases, the powerful global spatial context representation capability of ViT requires lengthy training procedures and large-scale pre-training data volume. To solve the above problems, a hybrid architecture of convolution neural network (CNN) and ViT is proposed to improve the RSIC ability, called P2FEViT, which integrates plug-and-play CNN features with ViT. In this paper, the feature representation capabilities of CNN and ViT applying for RSIC are first analyzed. Second, aiming to integrate the advantages of CNN and ViT, a novel approach embedding CNN features into the ViT architecture is proposed, which can make the model synchronously capture and fuse global context and local multimodal information to further improve the classification capability of ViT. Third, based on the hybrid structure, only a simple cross-entropy loss is employed for model training. The model can also have rapid and comfortable convergence with relatively less training data than the original ViT. Finally, extensive experiments are conducted on the public and challenging remote sensing scene classification dataset of NWPU-RESISC45 (NWPU-R45) and the self-built fine-grained target classification dataset called BIT-AFGR50. The experimental results demonstrate that the proposed P2FEViT can effectively improve the feature description capability and obtain outstanding image classification performance, while significantly reducing the high dependence of ViT on large-scale pre-training data volume and accelerating the convergence speed. The code and self-built dataset will be released at our webpages.

CapViT: Cross-context capsule vision transformers for land cover classification with airborne multispectral LiDAR data

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework

Land Cover Classification of Multispectral LiDAR Data With an Efficient Self-Attention Capsule Network

CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data

SCViT: A Spatial-Channel Feature Preserving Vision Transformer for Remote Sensing Image Scene Classification

Converging Channel Attention Mechanisms with Multilayer Perceptron Parallel Networks for Land Cover Classification

Multimodal Fusion Transformer for Remote Sensing Image Classification

P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification

Coupling video vision transformer (ViVit) into land change simulation: a comparison with three-dimensional convolutional neural network (3DCNN)

Cross Hyperspectral and LiDAR Attention Transformer: An Extended Self-Attention for Land Use and Land Cover Classification

Joint Classification of Hyperspectral Images and LiDAR Data Based on Dual-Branch Transformer

Vision Transformer for Multispectral Satellite Imagery: Advancing Landcover Classification*

Cross-Resolution Land Cover Classification Using Outdated Products and Transformers

Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

Classification of hyperspectral and LiDAR data by transformer-based enhancement

Cross-scale Vision Transformer for crowd localization

CViTF-Net: A Convolutional and Visual Transformer Fusion Network for Small Ship Target Detection in Synthetic Aperture Radar Images

A Joint Convolutional Cross ViT Network for Hyperspectral and Light Detection and Ranging Fusion Classification

Extracting Building Footprint From Remote Sensing Images by an Enhanced Vision Transformer Network