Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning

Deema Abdal Hafeth,Stefanos Kollias
DOI: https://doi.org/10.3390/s24061796
IF: 3.9
2024-03-11
Sensors
Abstract:Image captioning is a technique used to generate descriptive captions for images. Typically, it involves employing a Convolutional Neural Network (CNN) as the encoder to extract visual features, and a decoder model, often based on Recurrent Neural Networks (RNNs), to generate the captions. Recently, the encoder–decoder architecture has witnessed the widespread adoption of the self-attention mechanism. However, this approach faces certain challenges that require further research. One such challenge is that the extracted visual features do not fully exploit the available image information, primarily due to the absence of semantic concepts. This limitation restricts the ability to fully comprehend the content depicted in the image. To address this issue, we present a new image-Transformer-based model boosted with image object semantic representation. Our model incorporates semantic representation in encoder attention, enhancing visual features by integrating instance-level concepts. Additionally, we employ Transformer as the decoder in the language generation module. By doing so, we achieve improved performance in generating accurate and diverse captions. We evaluated the performance of our model on the MS-COCO and novel MACE datasets. The results illustrate that our model aligns with state-of-the-art approaches in terms of caption generation.
engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient visual feature extraction in the current image captioning technology, especially the lack of semantic concepts, which limits the comprehensive understanding of image content. Specifically, the existing image captioning methods mainly rely on convolutional neural networks (CNN) to extract the visual features of images and use recurrent neural networks (RNN) to generate caption texts. However, this method has limitations when dealing with objects in images and the relationships between them, especially lacking in - depth semantic understanding of image content. To overcome these challenges, the author proposes a new Transformer - based image captioning model, which enhances the extraction of visual features by introducing the semantic representation of objects in the encoder. Specifically, this model uses Faster R - CNN as an object detector to extract the visual features and class labels of objects from images. Then, it uses an external knowledge base (such as ConceptNet) to generate the semantic word - embedding representations of object classes. These visual features and semantic word - embedding representations are jointly passed as input to the encoder Transformer module, enabling the model to pay more attention to relevant regions and capture meaningful relationships between image objects. In addition, the decoder part also adopts the Transformer architecture to improve the accuracy and diversity of the generated captions. In summary, the main goal of this paper is to improve the quality of image captioning by integrating semantic information, making it more accurate, rich and diverse.