Attention Based Sequence-to-sequence Framework for Auto Image Caption Generation
Rashid Khan,M. Shujah Islam,Khadija Kanwal,Mansoor Iqbal,Md Imran Hossain,Zhongfu Ye
DOI: https://doi.org/10.3233/jifs-211907
2022-01-01
Journal of Intelligent & Fuzzy Systems
Abstract:Caption generation using an encoder-decoder approach has recently been extensively studied and implemented in various domains, including image captioning and code captioning. In this research article, we propose one particular approach for completing a capture generation task using an “attention”-based sequence-to-sequence framework that, when combined with a conventional encoder-decoder model, generates captions in an attention-based manner. ResNet-152 is a Convolutional Neural Network (CNN) based encoder that generates a comprehensive representation of an input image while embedding that into a fixed size length vector. To predict the next sentence, the decoder uses LSTM, a Recurrent Neural Network (RNN), and an attention-based mechanism to concentrate attention on certain sections of an image selectively. Define a set of epochs to 69, which should be enough for training the model to generate informative descriptions, and the validation loss has reached its minimum limit and no longer decreases. We present the datasets as well as the evaluation metrics, as well as quantitative and qualitative analysis. Experiments on the MSCOCO and Flickr8k benchmark datasets illustrate the model’s efficacy in comparison to the baseline techniques. On MSCOCO, evaluation scores included BLEU-1 0.81, BLEU-2 0.61, BLEU-3 0.47, and 0.33 METEOR. For Flickr8k BLEU-1 0.68, BLEU-2 0.49, BLEU-3 0.41, METEOR 0.23, and 0.86 on SPICE. The proposed approach is comparable with several state-of-the-art methods in terms of standard evaluation metric, demonstrating that it can produce more accurate and richer captions.