Abstract:Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key flaws in existing image captioning models when generating descriptions: 1. **Limitations of Traditional Reward Mechanisms**: Fine - tuning with manually - designed rewards (such as CIDEr metric) can improve the caption quality at the sequence level, but it often restricts the richness and semantic depth of descriptions and makes the model tend to imitate the style of real sentences, thus losing details and specificity. 2. **Reward Problems Based on the CLIP Model**: Recent attempts to use image - text models like CLIP as rewards can generate semantically rich sentences, but often lead to grammar errors and repetition problems and ignore the correct word order. To solve these problems, the authors propose a new method named Self - Cap. Self - Cap relies on a learnable reward model, which is based on self - generated negative samples and can distinguish the consistency between captions and their corresponding images. Specifically, their discriminator is a fine - tuned contrastive image - text model, aiming to promote the correctness of captions while avoiding the anomalies common in CLIP - based reward training. In addition, by directly integrating negative samples from the frozen caption generator, the quality and richness of the generated captions are significantly improved, and the optimization time is also reduced. ### Main Contributions - **Proposing a New Self - Supervised Reward Mechanism**: By introducing self - generated negative samples to improve the consistency evaluation between captions and images, the limitations of traditional reward mechanisms are avoided. - **Improving the Quality and Richness of Caption Generation**: Experimental results show that Self - Cap performs well on both standard and zero - shot image captioning datasets, and the generated captions are not only more in line with human judgment but also more grammatically accurate. ### Formula Representation Some formulas involved in the paper include: - **Cross - Entropy Loss**: \[ L_{XE}(\theta)=-\sum_{t = 1}^{T}\log\left(P(w_t|w_{1:t - 1}, R)\right) \] - **CLIP Similarity Calculation**: \[ S_{ij}=\text{sim}(T_i, V_j) \] where $\text{sim}(\cdot)$ represents cosine similarity. - **SCST Expected Gradient**: \[ \nabla_\theta L_{SCST}(I_i, s'_i, \theta)=(D_r(I_i, s'_i)-b)\nabla_\theta\log f_\theta(s'_i) \] where $b$ is the baseline, used to reduce the variance of gradient estimation. Through these improvements, Self - Cap can generate more rich and natural image captions while maintaining grammatical accuracy.

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Fine-grained Image Captioning with CLIP Reward

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Image Captioning with Partially Rewarded Imitation Learning.

Class-Conditional self-reward mechanism for improved Text-to-Image models

Cross-Domain Image Captioning with Discriminative Finetuning

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Actor-Critic Sequence Training for Image Captioning

An image caption model based on attention mechanism and deep reinforcement learning

Teacher-Critical Training Strategies for Image Captioning

Unpaired Image Captioning With semantic-Constrained Self-Learning

Self-critical n-step Training for Image Captioning

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Learning to Evaluate Image Captioning

Guiding Image Captioning Models Toward More Specific Captions

Updating CLIP to Prefer Descriptions Over Captions

VLRM: Vision-Language Models act as Reward Models for Image Captioning

Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions