VSAM-Based Visual Keyword Generation for Image Caption

Suya Zhang,Yana Zhang,Zeyu Chen,Zhaohui Li
DOI: https://doi.org/10.1109/ACCESS.2021.3058425
IF: 3.9
2021-01-01
IEEE Access
Abstract:Image caption is to understand and describe the visual content, which is expected to be applied in automatic news reporting in future. In recent years, there has been an increasing interest in an Encoder-Decoder framework for image caption: the encoder takes the responsibility for visual semantic comprehension and the decoder is designed for sentence generation. In the Encoder-Decoder framework the translation is based on the correspondence between image feature vectors and caption vectors. Attention mechanism makes sense for a more accurate correspondence. However, the attention model works with the decoder, and the focused content changes dynamically with the generated word. It results that in many cases the salient contents are not described in the caption, or the objects described are not the salient ones. To improve the precision of image caption, to bridge the gap between image understanding and sentence generation in the Encoder-Decoder framework, and to align visual information and semantic information better, we propose a concept of visual keyword as a gang board between seeing and saying. This paper presents an image dataset derived from MSCOCO as the first collection of visual keywords: Image Visual Keyword Dataset (IVKD). Also, a Visual Semantic Attention Model(VSAM) is proposed to obtain visual keywords for generating the annotation. In VSAM, the object-level visual features are extracted by an object detector after pre-training on IVKD. Then the object features are fed in an Optimized Pointer Network(OPN) to generate visual keywords. The experiments show that the precision of visual keyword generation reaches 91.7% by the proposed model VSAM.
What problem does this paper attempt to address?