Abstract:In recent years, convolutional neural network (CNN)-based methods have been widely used for remote sensing (RS) scene classification tasks and have achieved excellent results. However, CNNs are not good at exploring contextual information, which is essential for fully understanding RS scenes. A new model named transformer attracts researchers' attention to address this problem, which is skilled in mining the latent contextual information in RS scenes. Nevertheless, since the contents of RS scenes are diverse in type and various in scale, the performance of the original transformer in RS scene classification cannot reach what we expect. In addition, due to the specific self-attention mechanism, the time costs of the transformer are high, which hinders its practicability in the RS community. To overcome the above limitations, we propose a new model named efficient multiscale transformer and cross-level attention learning (EMTCAL) for RS scene classification in this article. EMTCAL combines the advantages of CNN and transformer to mine information within RS scenes fully. First, it uses a multilayer feature extraction module (MFEM) to acquire global visual features and multilevel convolutional features from RS scenes. Second, a contextual information extraction module (CIEM) is proposed to capture rich contextual information from multilevel features. In CIEM, taking the characteristics of RS scenes and the computational complexity into account, we propose an efficient multiscale transformer (EMST). EMST can mine the abundant knowledge with various scales hidden in RS scenes and model their inherent relations at small time costs. Third, a cross-level attention module (CLAM) is developed to aggregate and explore correlations of multilevel features. Finally, a class score fusion module (CSFM) is designed to integrate the contributions of global and aggregated multilevel features for discriminative scene representations. Extensive experiments are conduct- d on three public RS scene datasets. The positive results demonstrate that our EMTCAL can achieve superior classification performance and outperform many state-of-the-art methods. Our source codes are available in https://github.com/TangXu-Group/Remote-Sensing-Images-Classification/tree/main/EMTCAL.

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption

Cooperative Connection Transformer for Remote Sensing Image Captioning

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

EMTCAL: Efficient Multiscale Transformer and Cross-Level Attention Learning for Remote Sensing Scene Classification

Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

A multimodal hyper-fusion transformer for remote sensing image classification

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Dual-level Collaborative Transformer for Image Captioning

Entangled Transformer for Image Captioning

Transformer with multi-level grid features and depth pooling for image captioning

Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images

Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Multihead Global Attention and Spatial Spectral Information Fusion for Remote Sensing Image Compression

Co-Training Transformer for Remote Sensing Image Classification, Segmentation, and Detection

Context-Aware Transformer for image captioning

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Image Captioning In the Transformer Age