Le-BEiT: A Local-Enhanced Self-Supervised Transformer for Semantic Segmentation of High Resolution Remote Sensing Images.

Yifei Huang,Zideng Feng,Junli Yang,Bin Wang,Jiaying Wang,Zhenglin Xian
DOI: https://doi.org/10.1109/icip46576.2022.9897710
2022-01-01
Abstract:Semantic segmentation for remote sensing images (RSI) has been a thriving research topic for a long time. Existing supervised learning methods usually require a huge amount of labeled data. Meanwhile, large size, variation in object scales, and intricate details in RSI make it essential to capture both long-range context and local information. To address these problems, we propose Le-BEIT, a self-supervised Transformer with an improved positional encoding Local-Enhanced Positional Encoding (LePE). Self-supervised learning relieves the demanding requirement of a large amount of labeled data. The self-attention mechanism in Transformer has remarkable capability in capturing long-range context. Meanwhile, we use LePE as a substitution for Relative Positional Encoding (RPE) to represent local information more effectively. Moreover, considering the domain difference between natural images and RSI, instead of ImageNet-22K, we pre-train Le-BEIT on a very small high-resolution RSI dataset-GID. To investigate the influence of pre-training dataset size on segmentation accuracy, we furtherly conduct experiments on a larger pre-training dataset called GID-DOTA, which is 1/100 of ImageNet-22K, and have observed considerable accuracy improvements. The result of our method, which relies on a much smaller pretrained dataset, achieves competitive accuracy compared to the counterpart on ImageNet-22K.
What problem does this paper attempt to address?