Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

Antoine Chaffin,Ewa Kijak,Vincent Claveau
2024-02-22
Abstract:Training image captioning models using teacher forcing results in very generic samples, whereas more distinctive captions can be very useful in retrieval applications or to produce alternative texts describing images for accessibility. Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training, leading to more distinctive captions. Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions. However, we argue in this paper that Ground Truth (GT) captions can still be useful in this RL framework. We propose a new image captioning model training strategy that makes use of GT captions in different ways. Firstly, they can be used to train a simple MLP discriminator that serves as a regularization to prevent reward hacking and ensures the fluency of generated captions, resulting in a textual GAN setup extended for multimodal inputs. Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image. This objective acts as an additional learning signal grounded to the distribution of the GT captions. Thirdly, they can serve as strong baselines when added to the pool of captions used to compute the proposed contrastive reward to reduce the variance of gradient estimate. Experiments on MS-COCO demonstrate the interest of the proposed training strategy to produce highly distinctive captions while maintaining high writing quality.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The paper aims to address a key issue in image captioning: the generated captions are too generic and lack distinctiveness. Specifically, existing image captioning models tend to generate generic sentences describing the most salient objects in the image during training. While these sentences are correct, they lack specific descriptions of the particular image. This phenomenon leads to issues including: 1. **Overly Generic**: For example, "a person standing there" can be a correct description for multiple images showing someone, but it does not specifically describe any particular photo. 2. **Lack of Informativeness**: Generic captions fail to provide enough details to distinguish similar images, which is a shortcoming for retrieval applications or providing detailed image descriptions for visually impaired individuals. To overcome these issues, the paper proposes a new training method that leverages Ground Truth (GT) captions to optimize the trade-off between the uniqueness and writing quality of the generated captions. Specifically, this method includes the following points: 1. **Use of a Discriminator**: By training a simple Multi-Layer Perceptron (MLP) discriminator to distinguish between real and generated samples, it prevents reward hacking and ensures the fluency and natural language properties of the generated captions. 2. **Using GT as Additional Trajectories**: GT captions are used as additional training samples in the reinforcement learning process, promoting more descriptive caption generation through weighted teacher forcing loss. 3. **Contrastive Reward Mechanism**: Using GT captions as candidate baselines, the contrastive reward reduces the variance of gradient estimation, improving the stability of model learning. Through these strategies, experimental results show that the method can generate highly unique captions while maintaining high writing quality.