Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Sara Sarto,Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
2024-10-10
Abstract:Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: <a class="link-external link-https" href="https://github.com/aimagelab/pacscore" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Multimedia
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the existing image captioning evaluation metrics cannot comprehensively capture the quality or fine - grained details of captions. Specifically, the existing evaluation metrics usually rely on non - specific manually - written reference captions or noisy pre - training data, which leads to their poor performance in evaluating the grammatical correctness, semantic relevance and specificity of captions. In addition, these metrics sometimes wrongly penalize accurately generated captions that describe new elements not covered by the reference sentences, resulting in inaccurate evaluation. To solve these problems, the authors propose a new learnable evaluation metric PAC - S++ (Positive - Augmented Contrastive Learning for Vision - and - Language Evaluation and Training), which is improved based on the CLIP model and regularizes the model through additionally generated visual and textual positive samples. Specifically, PAC - S++ utilizes stronger and carefully curated pre - training data to improve the effectiveness of the evaluation metric. In addition, PAC - S++ is also applied to the Self - Critical Sequence Training (SCST) stage as a reward signal for fine - tuning the caption - generation model, thereby enhancing the semantic richness of the generated captions and reducing repetitions and grammatical errors. The following are the key innovations of PAC - S++: 1. **Positive - sample - augmented contrastive learning**: By introducing additional generated visual and textual positive samples, PAC - S++ enhances the effect of contrastive learning, enabling the model to better understand the relationship between images and texts. 2. **Low - rank fine - tuning**: Using the Low - Rank Adaptation (LoRA) technique, while retaining the pre - training model weights, injects trainable low - rank decomposition matrices, thereby reducing the number of trainable parameters, reducing the risk of over - fitting, and improving the training stability. 3. **Comprehensive evaluation method**: PAC - S++ can not only independently evaluate image - caption pairs, but also can conduct more detailed evaluations in combination with reference captions. In addition, this method is also extended to video caption evaluation, providing two embedding - matching strategies of fine - grained and coarse - grained. In summary, PAC - S++ aims to improve the quality of image and video caption generation by improving the evaluation metric, making it more in line with human judgment criteria, and reducing hallucinations and grammatical errors during the generation process. ### Formula Summary 1. **InfoNCE Loss**: \[ L_{V,T}=-\frac{1}{N}\sum_{i = 1}^{N}\log\frac{\exp(\text{sim}(v_i,t_i)/\tau)}{\sum_{j = 1}^{N}\exp(\text{sim}(v_i,t_j)/\tau)} \] \[ L_{V,T}=-\frac{1}{N}\sum_{i = 1}^{N}\log\frac{\exp(\text{sim}(v_i,t_i)/\tau)}{\sum_{j = 1}^{N}\exp(\text{sim}(v_j,t_i)/\tau)} \] 2. **Cosine Similarity**: \[ \text{sim}(v,t)=\cos(\text{Norm}(E_v(v)),\text{Norm}(E_t(t))) \] 3. **Final Loss Function**: \[ L = L_{V,T}+\lambda_vL_{V',T}+\lambda_tL_{V,T'} \] 4. **Reference - free Evaluation Score**: \[ \text{Score}(v,t)=w\cdot\max(\text{sim}(v,t),0) \] 5. **Reference - based Evaluation Score**: \[ \text{Ref - Scor}