Abstract:Image captioning is a technique that allows us to use computers to interpret the information in photographs and make written text. The use of deep learning to interpret image information and create descriptive text has become a widely researched issue since its establishment. Nevertheless, these strategies do not identify all samples that depict conceptual ideas. In reality, the vast majority of them seem to be irrelevant to the matching tasks. The degree of similarity is determined only by a few relevant semantic occurrences. This duplicate instance can also be thought of as noise, as it obstructs the matching process of a few meaningful instances and adds to the model's computational effort. In the existing scheme, traditional convolutional neural networks (CNN) are presented. For that reason, captioning is not effective due to its structure. Furthermore, present approaches frequently require the deliberate use of additional target recognition algorithms or costly human labeling when extracting information is required. For image captioning, this research presents a multimodal feature fusion-based deep learning model. The coding layer uses mask recurrent neural networks (Faster RCNN), the long short-term memory has been used to decode, and the descriptive text is constructed. In deep learning, the model parameters are optimized through the method of gradient optimization. In the decoding layer, dense attention mechanisms can assist in minimizing non-salient data interruption and preferentially input the appropriate data for the decryption stage. Input images are used to train a model that, when given the opportunity, will provide captions that are very close to accurately describing the images. Various datasets are used to evaluate the model's precision and the fluency, or mastery, of the language it acquires by analyzing picture descriptions. Results from these tests demonstrate that the model consistently provides correct descriptions of input images. This model has been taught to provide captions or words describing an input picture. To measure the effectiveness of the model, the system is given categorization scores. With a batch size of 512 and 100 training epochs, the suggested system shows a 95% increase in performance. The model's capacity to comprehend images and generate text is validated by the experimental data in the domain of generic images. This paper is implemented using Python frameworks and also evaluated using performance metrics such as PSNR, RMSE, SSIM, accuracy, recall, F1-score, and precision.

Image2Text: A multimodal caption generator

MAT: A Multimodal Attentive Translator for Image Captioning

Mmt: A Multimodal Translator For Image Captioning

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

Intelligent image captioning

From Captions to Visual Concepts and Back

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Image Captioning with Object Detection and Localization.

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Automatic Caption Generation for News Images

Text-to-image Generation Based on Spatial-Channel Attention and Semantic Redescription

Image Caption Generator Using Deep Learning

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

Enhanced Image Captioning with Color Recognition Using Deep Learning Methods

Accurate and Complete Captions for Question-controlled Text-aware Image Captioning

An Ensemble of Generation- and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network

A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

Image Captioning with Multi-Context Synthetic Data

Visuals to Text: A Comprehensive Review on Automatic Image Captioning