Advancements in Deep Learning-Based Image Captioning

Siqi Wang,Jihong Zhuang
DOI: https://doi.org/10.62051/d1jtjx50
2024-08-12
Abstract:In the confluence of natural language processing and machine vision, the field of image captioning has experienced exponential growth since the introduction of the BLEU evaluation algorithm by IBM in 2002. This discipline serves to bridge the "semantic gap" between human and machine perception, translating visual information into semantic narratives. Such technology is extensively applied in areas like human-computer interaction, video subtitling, quiz generation, and image-based search functionalities. The paper presents an analysis of two primary methodologies in image captioning: template-based and encoder-decoder-based structures. Template-based approaches, defined by the use of pre-set templates, ensure syntactic accuracy yet offer limited flexibility in caption generation. Innovations within this methodology, including paraphrase back-translation and the integration of psycholinguistics, have enhanced caption diversity and descriptiveness. On the other hand, the encoder-decoder framework, particularly the CNN-RNN model, utilizes deep neural networks to learn directly from image-caption pairs. This method represents a more dynamic and adaptable approach to caption generation. The amalgamation of Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM) networks within this framework has notably advanced the descriptive quality of captions, effectively addressing complex image contexts.
What problem does this paper attempt to address?