Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning

Deema Abdal Hafeth,Stefanos Kollias

DOI: https://doi.org/10.3390/s24061796

IF: 3.9

2024-03-11

Sensors

Abstract:Image captioning is a technique used to generate descriptive captions for images. Typically, it involves employing a Convolutional Neural Network (CNN) as the encoder to extract visual features, and a decoder model, often based on Recurrent Neural Networks (RNNs), to generate the captions. Recently, the encoder–decoder architecture has witnessed the widespread adoption of the self-attention mechanism. However, this approach faces certain challenges that require further research. One such challenge is that the extracted visual features do not fully exploit the available image information, primarily due to the absence of semantic concepts. This limitation restricts the ability to fully comprehend the content depicted in the image. To address this issue, we present a new image-Transformer-based model boosted with image object semantic representation. Our model incorporates semantic representation in encoder attention, enhancing visual features by integrating instance-level concepts. Additionally, we employ Transformer as the decoder in the language generation module. By doing so, we achieve improved performance in generating accurate and diverse captions. We evaluated the performance of our model on the MS-COCO and novel MACE datasets. The results illustrate that our model aligns with state-of-the-art approaches in terms of caption generation.

engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient visual feature extraction in the current image captioning technology, especially the lack of semantic concepts, which limits the comprehensive understanding of image content. Specifically, the existing image captioning methods mainly rely on convolutional neural networks (CNN) to extract the visual features of images and use recurrent neural networks (RNN) to generate caption texts. However, this method has limitations when dealing with objects in images and the relationships between them, especially lacking in - depth semantic understanding of image content. To overcome these challenges, the author proposes a new Transformer - based image captioning model, which enhances the extraction of visual features by introducing the semantic representation of objects in the encoder. Specifically, this model uses Faster R - CNN as an object detector to extract the visual features and class labels of objects from images. Then, it uses an external knowledge base (such as ConceptNet) to generate the semantic word - embedding representations of object classes. These visual features and semantic word - embedding representations are jointly passed as input to the encoder Transformer module, enabling the model to pay more attention to relevant regions and capture meaningful relationships between image objects. In addition, the decoder part also adopts the Transformer architecture to improve the accuracy and diversity of the generated captions. In summary, the main goal of this paper is to improve the quality of image captioning by integrating semantic information, making it more accurate, rich and diverse.

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning

Entangled Transformer for Image Captioning

Image Captioning: Transforming Objects into Words

End-to-End Transformer Based Model for Image Captioning

Boosting convolutional image captioning with semantic content and visual relationship

Comprehending and Ordering Semantics for Image Captioning

BENet: bi-directional enhanced network for image captioning

Improving Image Captioning with Better Use of Caption

Improving Image Captioning with Better Use of Captions

Tag‐inferring and tag‐guided Transformer for image captioning

An efficient automated image caption generation by the encoder decoder model

An image caption model based on attention mechanism and deep reinforcement learning

Object-Centric Unsupervised Image Captioning

Controllable image caption with an encoder-decoder optimization structure

Image Captioning In the Transformer Age

Enhancing Image Captioning Using Deep Convolutional Generative Adversarial Networks

Semantic association enhancement transformer with relative position for image captioning

Enhanced Modality Transition for Image Captioning

From Captions to Visual Concepts and Back

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network