Abstract:This article presents a comprehensive exploration into the realm of automatic image captioning systems, introducing an innovative deep neural network‐based encoder–decoder framework. The VGG‐19 model is employed as an image feature extractor, and the long short‐term memory network serves as a sequence processor, producing a fixed‐length output vector for final predictions. Diverse images sourced from open‐access datasets, including Flickr8k, Flickr30k, and MS COCO, are utilized for both training and testing. This study introduces a novel encoder–decoder framework based on deep neural networks and provides a thorough investigation into the field of automatic picture captioning systems. The suggested model uses a "long short‐term memory" decoder for word prediction and sentence construction, and a "convolutional neural network" as an encoder that is skilled at object recognition and spatial information retention. The long short‐term memory network functions as a sequence processor, generating a fixed‐length output vector for final predictions, while the VGG‐19 model is utilized as an image feature extractor. For both training and testing, the study uses a variety of photos from open‐access datasets, such as Flickr8k, Flickr30k, and MS COCO. The Python platform is used for implementation, with Keras and TensorFlow as backends. The experimental findings, which were assessed using the "bilingual evaluation understudy" metric, demonstrate the effectiveness of the suggested methodology in automatically captioning images. By addressing spatial relationships in images and producing logical, contextually relevant captions, the paper advances image captioning technology. Insightful ideas for future study directions are generated by the discussion of the difficulties faced during the experimentation phase. By establishing a strong neural network architecture for automatic picture captioning, this study creates opportunities for future advancement and improvement in the area.

TransEffiVisNet – an image captioning architecture for auditory assistance for the visually impaired

Scene Text Detection and Recognition System for Visually Impaired People in Real World

Image Recognition Using Text and Audio Translation for the Visually Challenged

Neuraltalk+: neural image captioning with visual assistance capabilities

D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals

Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment

A Study of ConvNeXt Architectures for Enhanced Image Captioning

An accurate generation of image captions for blind people using extended convolutional atom neural network

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

BENet: bi-directional enhanced network for image captioning

Synthesis of Vision and Language: Multifaceted Image Captioning Application

Advancing image captioning with V16HP1365 encoder and dual self-attention network

End-to-End Transformer Based Model for Image Captioning

Avtmnet: Adaptive Visual-Text Merging Network for Image Captioning

An efficient automated image caption generation by the encoder decoder model

A comprehensive construction of deep neural network‐based encoder–decoder framework for automatic image captioning systems

A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Embedded Computer Vision for Object Recognition in Smart Devices for the Blind

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Video captioning – a survey