Image Caption Method from Coarse to Fine Based on Dual Encoder-Decoder Framework

Zefeng Li,Yuehu Liu,Yonghong Song
DOI: https://doi.org/10.1109/ijcnn60899.2024.10650584
2024-01-01
Abstract:Encoders are widely used in the field of image caption, but the statements generated by the current image caption method may miss the target and the generated description statements are not appropriate enough for the image content. In order to solve the above problems, we propose a coarse-fine image caption method based on dual encoder-decoder framework, which provides a mechanism for discovering and correcting omissions and enables the model to generate a complete image description. Firstly, an image feature extractor based on global and local information is designed, which can extract global information and local information of image and obtain more abundant image representation. Secondly, a dual encoder-decoder framework is designed, which consists of a coarse-grained encoder-decoder and a fine-grained encoder-decoder. Coarse-grained encoder-decoder requires only the original image features as input, which is processed by transformer to produce a coarse text description. In addition, an image feature auto-enhancement module is proposed to detect missing objects in coarse text and enhance their feature expression. Finally, the fine-grained encoder-decoder uses both the image feature and the coarse text caption as input, and generates the final fine-grained caption after multi-modal information fusion. Experimental results on MSCOCO datasets show that our proposed method outperforms previous image caption methods and achieves a performance of 39.7 BLEU-4 score and 121.6 CIDEr-A score.
What problem does this paper attempt to address?