DSViT: Dynamically Scalable Vision Transformer for Remote Sensing Image Segmentation and Classification.

Falin Wang,Jian Ji,Yuan Wang
DOI: https://doi.org/10.1109/jstars.2023.3285259
IF: 4.715
2023-01-01
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Abstract:The relationship between the foreground target and the background of remote sensing image is very complex. The vision task of remote sensing image faces the problems of complex targets and unbalanced categories. These problems make the modeling method have further improvement space. Therefore, this article proposes a dynamically scalable attention model that combines convolutional features and Transformer features. It can dynamically select the model depth according to the size of the input image, which alleviates the problem of insufficient global information extraction of the single convolution model and the computational overhead limitation of the pure Transformer model. We validated the model on two public remote sensing image classifications and two remote sensing image segmentation datasets. The accuracy and mean pixel accuracy (mPA) of the method in this article reached 96.16% and 93.44%, respectively, on the university of california (UC) Merced classification dataset. Compared with some recent work, the method has a net improvement of 5.0% and 4.82% over the pyramid vision transformer (PVT) model. On the Potsdam segmentation dataset, the accuracy and F1 of the transformer and CNN hybrid neural network (TCHNN) model are 91.5% and 92.86%, respectively. The performance of the method has improved 0.64% and 1.0%, and the other two datasets have also achieved the best results.
What problem does this paper attempt to address?