Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images.

Xin Li,Feng Xu,Runliang Xia,Nan Xu,Fan Liu,Chi Yuan,Qian Huang,Xin Lyu
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446525
2024-01-01
Abstract:Transformers have emerged as a transformative tool in various computer vision tasks, excelling at capturing long-range dependencies. Their potential applicability and scalability in the interpretation of high-resolution remote sensing images (HRRSIs) have thus garnered substantial interest. However, unlike natural images, HRRSIs present intricate scenes characterized by scale variations and diverse appearances. These challenges underscore the importance of enabling networks to effectively assimilate both local intricacies and global context. In this letter, we introduce LETFormer, a semantic segmentation transformer. LETFormer balances capturing longrange dependencies with preserving local details through its unique LETFormer block, featuring an anchor token. This token aggregates localized contextual information within a designated window and promotes meaningful interactions among anchor tokens. With a mask transformer decoder, LETFormer gains ample contextual cues for precise semantic mask prediction. Empirical findings based on evaluations using the ISPRS Potsdam and LoveDA benchmarks unequivocally establish LETFormer’s superiority over state-of-the-art models. Additionally, we analyze the parameter size and floating-point operations per second (FLOPs) of LETFormer.
What problem does this paper attempt to address?