Controllable image caption with an encoder-decoder optimization structure

Jie Shao,Runxia Yang
DOI: https://doi.org/10.1007/s10489-021-02988-x
IF: 5.3
2022-01-25
Applied Intelligence
Abstract:Controllable image caption, which belongs to the intersection of Computer Vision (CV) and Natural Language Process (NLP), is an important part of applying artificial intelligence to many life scenes. We adopt an encoder-decoder structure, which considers visual models as the encoder and regards language models as the decoder. In this work, we introduce a new feature extraction model, namely FVC R-CNN, to learn both the salient features and the visual commonsense features. Furthermore, a novel MT-LSTM neural network for sentence generation is proposed, which is activated by m-tanh and is superior to the traditional Long Short-term memory Network (LSTM) by a significant margin. Finally, we put forward a multi-branch decision strategy to optimize the output. The experimental results are conducted on the widely used COCO Entities dataset, which demonstrates that the proposed method simultaneously outperforms the baseline, surpassing the state-of-the-art methods under a wide range of evaluation metrics. There are CIDEr and SPICE respectively achieves 206.3 and 47.6, yield state-of-the-art (SOTA) performance.
computer science, artificial intelligence
What problem does this paper attempt to address?