Abstract:More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of their model structures, the simplicity or complexity of featuretext fusion, and the uniformity of training objectives have all to some extent affected the diversity and effectiveness of caption generation, thus limiting the potential applications of this task. Therefore, in this paper, we propose the Regular Constrained Multimodal Fusion (RCMF) method for image captioning to better integrate information across and within modalities, while also approaching human-like fine-grained semantic perception and relationship reasoning capabilities. Initially, our RCMF preprocesses images using a Swin-Transformer and then an extended encoder with a new intra-modal fusion module, utilizing window-focused linear attention to capture features and leveraging refined grid and global visual features. By combining text features, RCMF employs a cross-modal fusion module and decoder to deeply model the interaction between text and image. Additionally, RCMF first introduces a new additional regulatory modal fusion reasoning (MFR) branch, which surpasses the above architectures. Its MFR loss combined with cross-entropy loss forms a new training objective strategy, effectively mining fine-grained relationships between images and text, perceiving the semantic information of images and their corresponding captions, thereby regulating the generated captions to be more diverse and human-like. Experimental results based on the MS COCO 2014 dataset, particularly under the same experimental conditions, demonstrate the outstanding performance of our method, especially in terms of METEOR, ROUGE-L, CIDEr, and SPICE metrics. Visualization results further intuitively confirm the effectiveness of our RCMF method. Source code in https://github.com/200084/RCMF-for-image-caption.

OFCap:Object-aware Fusion for Image Captioning

CapsFusion: Rethinking Image-Text Data at Scale

Fine-Grained Features for Image Captioning

CA-Captioner: A Novel Concentrated Attention for Image Captioning

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

Face-Cap: Image Captioning using Facial Expression Analysis

Intention Oriented Image Captions With Guiding Objects

Object-Centric Unsupervised Image Captioning

CapOnImage: Context-driven Dense-Captioning on Image

ViTOC: Vision Transformer and Object-aware Captioner

Bounding and Filling: A Fast and Flexible Framework for Image Captioning

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Combining Object-Based Attention And Attributes For Image Captioning

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Auxiliary feature extractor and dual attention-based image captioning

Regular Constrained Multimodal Fusion for Image Captioning

FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning

Nocaps: Novel object captioning at scale

Image Caption Method from Coarse to Fine Based on Dual Encoder-Decoder Framework

OSIC: A New One-Stage Image Captioner Coined

Entrocap: Zero-Shot Image Captioning with Entropy-Based Retrieval