Abstract:Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.

Partial Off-policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

Variational Transformer: A Framework Beyond the Trade-off Between Accuracy and Diversity for Image Captioning

Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning

Diverse and Controllable Image Captioning with Part-of-Speech Guidance.

Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning

Image Captioning with Partially Rewarded Imitation Learning.

Multi-Level Policy and Reward Reinforcement Learning for Image Captioning

Towards Diverse and Accurate Image Captions via Reinforcing Determinantal Point Process

Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning

Image Captioning Based on Adaptive Balancing Loss.

Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning

Self-attention Multi-view Representation Learning with Diversity-promoting Complementarity

Improve Diverse Text Generation by Self Labeling Conditional Variational Auto Encoder

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Fast, Diverse and Accurate Image Captioning Guided by Part-of-Speech

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Quality Diversity through Human Feedback: Towards Open-Ended Diversity-Driven Optimization

Diverse Image Captioning Via GroupTalk

Iteratively Learn Diverse Strategies with State Distance Information