TransEffiVisNet – an image captioning architecture for auditory assistance for the visually impaired

Harshitha R,Lakshmipriya B,Vallidevi Krishnamurthy
DOI: https://doi.org/10.1007/s11042-024-20036-x
IF: 2.577
2024-08-26
Multimedia Tools and Applications
Abstract:Insights derived out of image captioning systems have potential applications in real life, including providing auditory assistance for the visually impaired. This paper proposes TransEffiVisNet, a novel image captioning system embedded with text-to-speech generation, which can be deployed to aid individuals with visual impairments. The proposed model combines a convolutional neural network (CNN) and a transformer-based encoder and decoder to generate descriptive captions for the visual content provided as an image input. A comparative study was conducted to assess the performance of the proposed model formulated using five prominent CNN models—VGG16, ResNet50, EfficientNetB0, EfficientNetB1, and Inception V3, while maintaining the same transformer architecture. The image captioning model with EfficientNetB1 showcased phenomenal results compared to four other architectures and hence preferred to be used in TransEffiVisNet. Integration of Google Text-to-Speech technology to convert captions into speech further enhances the functionality of the model which in future can be deployed for visually impaired assistance.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?