Abstract:Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.

Attribute Guided Fusion Network for Obtaining Fine-Grained Image Captions

Attribute-Driven Filtering: A New Attributes Predicting Approach for Fine-Grained Image Captioning

Attribute-Driven Filtering

CASCADE ATTENTION FUSION FOR FINE-GRAINED IMAGE CAPTIONING BASED ON MULTI-LAYER LSTM

ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor

GVA: guided visual attention approach for automatic image caption generation

Dynamic-balanced Double-Attention Fusion for Image Captioning

Scene captioning with deep fusion of images and point clouds

Feature Fusion Based on Neural Image Captioning with Spatial Attention

Delving Into Precise Attention In Image Captioning

Adaptive semantic guidance network for video captioning

Image Captioning with End-to-End Attribute Detection and Subsequent Attributes Prediction

Incorporating retrieval-based method for feature enhanced image captioning

Image Captioning with Attribute Refinement.

M-FFN: multi-scale feature fusion network for image captioning

Auxiliary feature extractor and dual attention-based image captioning

Avtmnet: Adaptive Visual-Text Merging Network for Image Captioning

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

Image Captioning using Facial Expression and Attention

Fine-grained image emotion captioning based on Generative Adversarial Networks

A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning