Abstract:We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in the semantic segmentation task of satellite image time - series (SITS), as follows: 1. **Improving the scalability and efficiency of the model**: Existing deep - learning models are usually computationally costly and have a large number of parameters when processing SITS data. The VistaFormer proposed in this paper simplifies the model architecture by introducing a lightweight decoder and a multi - scale Transformer encoder, reduces the number of floating - point operations (FLOPs), and decreases the number of parameters. 2. **Dealing with input noise (such as cloud occlusion)**: Since more than 60% of the Earth's surface is covered by clouds, many additional inputs may be partially or completely occluded by clouds. VistaFormer improves the robustness of the model by filtering out these noise signals through techniques such as gated convolutions. 3. **Adapting to data of different resolutions**: When the resolutions of training and test data are different, traditional position - encoding - based methods may lead to performance degradation. VistaFormer avoids this problem by removing the dependence on position encoding, enabling the model to better adapt to data of different resolutions. 4. **Improving the spatio - temporal attention mechanism**: In order to better capture information in the time and space dimensions, VistaFormer adopts Neighbourhood Attention (NA) instead of Multi - Head Self - Attention (MHSA). This not only reduces the computational complexity but also improves the scalability of the model under different - sized inputs. ### Main contributions - **Performance improvement**: Experimental results show that VistaFormer achieves better results than the existing state - of - the - art models on both the PASTIS and MTLCC benchmark datasets. In particular, the mIoU score of the NA - version VistaFormer on the MTLCC dataset is improved by 3.7%. - **Efficiency**: VistaFormer only requires 8% of the MHSA FLOPs and 11% of the NA FLOPs, and also has fewer parameters, significantly reducing the computational resource requirements. - **Simplifying the model architecture**: By removing the need for position - encoding interpolation, VistaFormer simplifies the model preparation process, making the model easier to implement and deploy. In conclusion, VistaFormer aims to provide a lightweight and efficient solution to address the challenges in the semantic segmentation of satellite image time - series, and provide strong support for fields such as agricultural monitoring and climate change adaptation.

VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation

Representation Separation for Semantic Segmentation with Vision Transformers

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

A convolutional vision transformer for semantic segmentation of side-scan sonar data

Vision Transformer with Sparse Scan Prior

SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Semantic segmentation using cross-stage feature reweighting and efficient self-attention

A Novel Mamba Architecture with a Semantic Transformer for Efficient Real-Time Remote Sensing Semantic Segmentation

UNeXt: An Efficient Network for the Semantic Segmentation of High-Resolution Remote Sensing Images

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Transformer-Based Semantic Segmentation for Extraction of Building Footprints from Very-High-Resolution Images

DSCAFormer: Lightweight Vision Transformer With Dual-Branch Spatial Channel Aggregation

AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation

HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo

Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images.

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images

Efficient Transformer for Remote Sensing Image Segmentation

LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation