Advancing image captioning with V16HP1365 encoder and dual self-attention network

Tarun Jaiswal,Manju Pandey,Priyanka Tripathi,Jaiswal, Tarun
DOI: https://doi.org/10.1007/s11042-024-18467-7
IF: 2.577
2024-03-08
Multimedia Tools and Applications
Abstract:Image captioning generates textual description from the corresponding input image with the help of computer vision and natural language processing. In recent years, deep learning approaches have shown promise in image captioning. This research introduces a novel image captioning architecture comprising a dual self-attention fused encoder-decoder framework. The VGG16 Hybrid Places 1365 (V16HP1365) encoder captures diverse visual features from images, enhancing the quality of image representations. In this article, the Gated Recurrent Unit (GRU) is considered as a decoder for conducting word-level language modeling. Additionally, the dual self-attention network embedded in the architecture allows for capturing contextual image information to provide accurate content descriptions and relationship understanding. Experimental evaluations on the COCO dataset showcase superior performance, surpassing existing methods in terms of captioning quality metrics. This approach holds potential for applications such as aiding the visually impaired and advancing content retrieval. Future work aims to extend the model to support multilingual captioning.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?