Attribute Guided Fusion Network for Obtaining Fine-Grained Image Captions

Md. Bipul Hossen,Zhongfu Ye,Amr Abdussalam,Fazal E Wahab
DOI: https://doi.org/10.1007/s11042-024-19410-6
IF: 2.577
2024-01-01
Multimedia Tools and Applications
Abstract:Fine-grained image captioning is gaining traction in multimedia, merging vision-to-language tasks, with attribute selection now recognized as pivotal in improving performance. While promising results have been achieved, several challenges remain: 1) Many existing approaches in image captioning focus solely on visual features, overlooking potentially significant textual features such as attributes; 2) Indiscriminate use of attributes at each time step, even when many of these attributes are irrelevant to the currently generated word, leading to a decrease in performance. To tackle these challenges, we propose an Attribute Guided Fusion (AGF) image captioning network designed to generate visually and contextually rich captions. Our model incorporates a unified Attribute Selector Module (ASM), which dynamically selects the most relevant attributes based on the linguistic context, leveraging varying attributes across different time steps. This module explores the impact of semantic information by excluding irrelevant attributes, thereby aiding in the production of accurate and contextually fitting captions. Moreover, our model integrates a fusion mechanism that combines visual data from a guided visual attention module with attribute information chosen by the ASM. This integration serves to mitigate the visual semantic disparity between attributes and images. Comprehensive experiments reveal the superiority of the AGF model over advanced approaches, attaining an average CIDEr-D score of 8.81
What problem does this paper attempt to address?