A Joint-Training Two-Stage Method For Remote Sensing Image Captioning.

Xiutiao Ye,Shuang Wang,Yu Gu,Jihui Wang,Ruixuan Wang,Biao Hou,Fausto Giunchiglia,Licheng Jiao
DOI: https://doi.org/10.1109/TGRS.2022.3224244
2022-01-01
Abstract:Compared with remote sensing image (RSI) captioning methods based on the traditional encoder-decoder model, two-stage RSI captioning methods include an auxiliary remote sensing task to provide prior information, which enables them to generate more accurate descriptions. In previous two-stage RSI captioning methods, however, the image captioning and the auxiliary remote sensing tasks are handled separately, which is time-consuming and ignores mutual interference between tasks. To solve this problem, we propose a novel joint-training two-stage (JTTS) RSI captioning method. We use multilabel classification to provide prior information, and we design a differentiable sampling operator to replace the traditional nondifferentiable sampling operation to index the multilabel classification result. In contrast to previous two-stage RSI captioning methods, our method can implement joint training, and the joint loss allows the error of the generated description to flow into the optimization of the multilabel classification via backpropagation. Specifically, we approximate the Heaviside step function with the steep logistic function to implement a differentiable sampling operator for the multilabel classification. We propose a dynamic contrast loss function for multilabel classification tasks to ensure that a certain margin is maintained between the probabilities of the positive label and the negative label during sampling. We design an attribute-guided decoder to filter the multilabel prior information obtained by the sampling operator to generate more accurate image captions. The results of extensive experiments show that the JTTS method achieves state-of-the-art performance on the RSI captioning dataset (RSICD), the University of California, Merced (UCM)-captions, and the Sydney-captions datasets.
What problem does this paper attempt to address?