Abstract:In recent years, convolutional neural network (CNN)-based methods have been widely used for remote sensing (RS) scene classification tasks and have achieved excellent results. However, CNNs are not good at exploring contextual information, which is essential for fully understanding RS scenes. A new model named transformer attracts researchers' attention to address this problem, which is skilled in mining the latent contextual information in RS scenes. Nevertheless, since the contents of RS scenes are diverse in type and various in scale, the performance of the original transformer in RS scene classification cannot reach what we expect. In addition, due to the specific self-attention mechanism, the time costs of the transformer are high, which hinders its practicability in the RS community. To overcome the above limitations, we propose a new model named efficient multiscale transformer and cross-level attention learning (EMTCAL) for RS scene classification in this article. EMTCAL combines the advantages of CNN and transformer to mine information within RS scenes fully. First, it uses a multilayer feature extraction module (MFEM) to acquire global visual features and multilevel convolutional features from RS scenes. Second, a contextual information extraction module (CIEM) is proposed to capture rich contextual information from multilevel features. In CIEM, taking the characteristics of RS scenes and the computational complexity into account, we propose an efficient multiscale transformer (EMST). EMST can mine the abundant knowledge with various scales hidden in RS scenes and model their inherent relations at small time costs. Third, a cross-level attention module (CLAM) is developed to aggregate and explore correlations of multilevel features. Finally, a class score fusion module (CSFM) is designed to integrate the contributions of global and aggregated multilevel features for discriminative scene representations. Extensive experiments are conduct- d on three public RS scene datasets. The positive results demonstrate that our EMTCAL can achieve superior classification performance and outperform many state-of-the-art methods. Our source codes are available in https://github.com/TangXu-Group/Remote-Sensing-Images-Classification/tree/main/EMTCAL.

FCT: Fusing CNN and Transformer for Scene Classification

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

CTMFNet: CNN and Transformer Multiscale Fusion Network of Remote Sensing Urban Scene Imagery

CCTSS: the Combination of CNN and Transformer with Shared Sublayer for Detection and Classification

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

SwinHCST: a deep learning network architecture for scene classification of remote sensing images based on improved CNN and Transformer

Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image

IFTSDNet: An Interact-Feature Transformer Network With Spatial Detail Enhancement Module for Change Detection

DctViT: Discrete Cosine Transform Meet Vision Transformers

CMT: Convolutional Neural Networks Meet Vision Transformers

EMTCAL: Efficient Multiscale Transformer and Cross-Level Attention Learning for Remote Sensing Scene Classification

A multimodal hyper-fusion transformer for remote sensing image classification

Relating CNN-Transformer Fusion Network for Change Detection

Bridging CNN and Transformer With Cross-Attention Fusion Network for Hyperspectral Image Classification

CTFuseNet: A Multi-Scale CNN-Transformer Feature Fused Network for Crop Type Segmentation on UAV Remote Sensing Imagery

CITNet: Convolution Interaction Transformer Network for Hyperspectral and LiDAR Image Classification

3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification

When CNN meet with ViT: decision-level feature fusion for camouflaged object detection