Abstract:Semantic segmentation by using remote sensing images is an efficient method for agricultural crop classification. Recent solutions in crop segmentation are mainly deep-learning-based methods, including two mainstream architectures: Convolutional Neural Networks (CNNs) and Transformer. However, these two architectures are not sufficiently good for the crop segmentation task due to the following three reasons. First, the ultra-high-resolution images need to be cut into small patches before processing, which leads to the incomplete structure of different categories’ edges. Second, because of the deficiency of global information, categories inside the crop field may be wrongly classified. Third, to restore complete images, the patches need to be spliced together, causing the edge artifacts and small misclassified objects and holes. Therefore, we proposed a novel architecture named the Coupled CNN and Transformer Network (CCTNet), which combines the local details (e.g., edge and texture) by the CNN and global context by Transformer to cope with the aforementioned problems. In particular, two modules, namely the Light Adaptive Fusion Module (LAFM) and the Coupled Attention Fusion Module (CAFM), are also designed to efficiently fuse these advantages. Meanwhile, three effective methods named Overlapping Sliding Window (OSW), Testing Time Augmentation (TTA), and Post-Processing (PP) are proposed to remove small objects and holes embedded in the inference stage and restore complete images. The experimental results evaluated on the Barley Remote Sensing Dataset present that the CCTNet outperformed the single CNN or Transformer methods, achieving 72.97% mean Intersection over Union (mIoU) scores. As a consequence, it is believed that the proposed CCTNet can be a competitive method for crop segmentation by remote sensing images.

Cooperative Connection Transformer for Remote Sensing Image Captioning

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Dual-level Collaborative Transformer for Image Captioning

TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Encoding Contextual Information by Interlacing Transformer and Convolution for Remote Sensing Imagery Semantic Segmentation

Exploring refined dual visual features cross-combination for image captioning

Region-Focused Network for Dense Captioning

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Exploring and Distilling Cross-Modal Information for Image Captioning

TSFNet: Triple-Steam Image Captioning

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation

CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images.