Neuraltalk+: neural image captioning with visual assistance capabilities
Himanshu Sharma,Devanand Padha
DOI: https://doi.org/10.1007/s11042-024-19259-9
IF: 2.577
2024-04-24
Multimedia Tools and Applications
Abstract:Image captioning is a technique that generates concise and meaningful descriptions of the visual contents present in an image. Image captioning frameworks generally employ an encoder-decoder-based pipeline to generate image descriptions. Multimodal meaning space, visual and semantic fusion, and influential recurrent decoding are some of the highlights of these frameworks. However, the lack of cutting-edge implementation schemes, such as ensemble feature extraction, context-aware fusion, and real-time captioning, limit their integration in the vision assistance domain. In this research work, we introduce Neuraltalk+, which comprises various structural and functional enhancements, and feature-based extensions, making it lightweight, robust, effective, and automated. Neuraltalk+ uses ensemble feature extraction to extract visual and spatial image features for efficient image comprehension. We then map these feature vectors with multimodal semantic knowledge using dual context-aware feature fusion followed by self-attention-assisted decoding. Lastly, we introduce two new features: real-time captioning and visual similarity comparison, which allow vision assistance and sight comprehension capabilities. Experimental analysis on the Flickr 8K and Flickr 30K datasets demonstrates that our model trains faster and generates improved quantitative (BLEU(72.08), METEOR(33.65), and CIDEr(143.5)) and qualitative results. Neuraltalk+ also demonstrates high performance in real-time captioning for both familiar and unfamiliar contexts. We also offer potential suggestions for extending our work in the future.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering