Image Captioning with a Constraint of Image-to-Text Transformation

Jing Wen,Han Liu,Xiaolin Hu,Jianmin Li,Zhaoxiang Zhang
2020-01-01
Abstract:The encoder-decoder framework is widely adopted in image captioning where the encoder generates image features and the decoder receives the image features and generates captions. However, this framework has insufficient capability to reduce the gap between image and text representations, thus leading to poor generation results. One solution is to embed the two modalities in the same space such that the representation of an image region (e.g., a person) is close to the representation of a corresponding word (e.g., “person”) in that space. To achieve this goal, we propose to add a constraint to the encoder-decoder framework such that the image features can be transformed to the text embedding space and represent the captions. By minimizing an auxiliary loss function which encourages the transformed image representation to be close to the caption representation, we explicitly bridge the gap between two modalities. The decoder learns this image-to-text transformation and generates better captions for given images. Experiments on the MSCOCO captioning dataset demonstrate the effectiveness of the proposed method.
What problem does this paper attempt to address?