Exploring Vision Transformers for Polarimetric SAR Image Classification

Hongwei Dong,Lamei Zhang,Bin Zou
DOI: https://doi.org/10.1109/tgrs.2021.3137383
IF: 8.2
2021-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:As one of the most popular topics in polarimetric synthetic aperture radar (PolSAR) community, PolSAR image classification has always been an important way for PolSAR applications. Constructing representations is the most critical part of PolSAR image classification. With the maturity of deep learning technique, many data-driven PolSAR representation methods have been proposed, most of which are based on convolutional neural networks (CNNs). Despite some achievements, the bottleneck of CNN-based methods may be related to the locality induced by their inductive biases. Considering this problem, the state-of-the-art method in natural language processing, i.e., transformer, is introduced into PolSAR image classification for the first time. Specifically, a vision transformer (ViT)-based representation learning framework is proposed in this article, which covers both supervised learning and unsupervised learning. For supervised learning, we use self-attention to replace convolution, which shifts the focus from the information in local neighborhoods to the long-range interactions between each pixel. Beyond supervised learning, we introduce an improved contrastive-based strategy to implement simple unsupervised representation learning. Compared with CNN and its variants, ViT constructs more global representations by explicitly modeling the relationship between each pixel, so as to improve the classification performance. Experimental results on four widely used PolSAR image datasets indicate that the representation obtained by the ViT-based methods is better for PolSAR image classification, whether supervised (up to about 5% accuracy improvement) or unsupervised (up to about 4%). In addition, we also prove the robustness of ViT to the initial input form. These discoveries may arouse rethinking of the dominance of CNNs in PolSAR image classification.
What problem does this paper attempt to address?