Jointing Cross-Modality Retrieval to Reweight Attributes for Image Caption Generation

Yuxuan Ding,Wei Wang,Mengmeng Jiang,Heng Liu,Donghu Deng,Wei Wei,Chunna Tian
DOI: https://doi.org/10.1007/978-3-030-31726-3_6
2019-01-01
Abstract:Automatic natural language description for images is one of the key issues towards image understanding. In this paper, we propose an image caption framework, which explores specific semantics jointing with general semantics. For specific semantics, we propose to retrieve captions of the given image in a visual-semantic embedding space. To explore the general semantics, we first extract the common attributes of the image by Multiple Instance Learning (MIL) detectors. Then, we use the specific semantics to re-rank the semantic attributes extracted by MIL, which are mapped into visual feature layer of CNN to extract the jointing visual feature. Finally, we feed the visual feature to LSTM and generate the caption of image under the guidance of BLEU_4 similarity, incorporating the sentence-making priors of reference captions. We evaluate our algorithm on standard metrics: BLEU, CIDEr, ROUGE_L and METEOR. Experimental results show our approach outperforms the state-of-the-art methods.
What problem does this paper attempt to address?