An efficient automated image caption generation by the encoder decoder model
Khustar Ansari,Priyanka Srivastava
DOI: https://doi.org/10.1007/s11042-024-18150-x
IF: 2.577
2024-01-23
Multimedia Tools and Applications
Abstract:Image caption generation is becoming one of the hot research topics and attracts various researchers. It is a complex process because it utilizes both NLP (natural language processing) and computer vision approaches for generating the tasks. A range of strategies are available for image captioning that connect the visual material with everyday language, such as explaining images with textual descriptions. Pre-trained classification networks like CNN and RNN-based neural network models are used in the literature to encrypt visual data. Even though various literature works have analyzed outstanding image caption techniques, they still lack in providing better performance for diverse databases. To overcome such issues, this research work presents an automated optimization deep learning model for image caption generation. Initially, the input image is pre-processed, and then the encoder decoder-based structure is utilized for extracting the visual features and caption generation. On the encoder side, the pre-trained ResNet 101 (residual network) is used to extract the visual features, and the SA- Bi-LSTM (self-attention with bi-directional Long Short-Term Memory) is used to generate the caption on the decoder side. In addition, an optimization model CA (Chimp algorithm) is used to improve detection performance in caption generation. The proposed encoder-decoder model is tested on benchmark datasets like Flickr8k, Flickr30k and COCO. Further, this model attained better BLEU and ribes scores of 0.8595 and 0.3531 on the Flickr8k dataset. Thus, the proposed SA-BiLSTM model achieved a significant performance in image caption generation.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering