Abstract:Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: <a class="link-external link-https" href="https://github.com/aimagelab/pacscore" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the existing image captioning evaluation metrics cannot comprehensively capture the quality or fine - grained details of captions. Specifically, the existing evaluation metrics usually rely on non - specific manually - written reference captions or noisy pre - training data, which leads to their poor performance in evaluating the grammatical correctness, semantic relevance and specificity of captions. In addition, these metrics sometimes wrongly penalize accurately generated captions that describe new elements not covered by the reference sentences, resulting in inaccurate evaluation. To solve these problems, the authors propose a new learnable evaluation metric PAC - S++ (Positive - Augmented Contrastive Learning for Vision - and - Language Evaluation and Training), which is improved based on the CLIP model and regularizes the model through additionally generated visual and textual positive samples. Specifically, PAC - S++ utilizes stronger and carefully curated pre - training data to improve the effectiveness of the evaluation metric. In addition, PAC - S++ is also applied to the Self - Critical Sequence Training (SCST) stage as a reward signal for fine - tuning the caption - generation model, thereby enhancing the semantic richness of the generated captions and reducing repetitions and grammatical errors. The following are the key innovations of PAC - S++: 1. **Positive - sample - augmented contrastive learning**: By introducing additional generated visual and textual positive samples, PAC - S++ enhances the effect of contrastive learning, enabling the model to better understand the relationship between images and texts. 2. **Low - rank fine - tuning**: Using the Low - Rank Adaptation (LoRA) technique, while retaining the pre - training model weights, injects trainable low - rank decomposition matrices, thereby reducing the number of trainable parameters, reducing the risk of over - fitting, and improving the training stability. 3. **Comprehensive evaluation method**: PAC - S++ can not only independently evaluate image - caption pairs, but also can conduct more detailed evaluations in combination with reference captions. In addition, this method is also extended to video caption evaluation, providing two embedding - matching strategies of fine - grained and coarse - grained. In summary, PAC - S++ aims to improve the quality of image and video caption generation by improving the evaluation metric, making it more in line with human judgment criteria, and reducing hallucinations and grammatical errors during the generation process. ### Formula Summary 1. **InfoNCE Loss**: \[ L_{V,T}=-\frac{1}{N}\sum_{i = 1}^{N}\log\frac{\exp(\text{sim}(v_i,t_i)/\tau)}{\sum_{j = 1}^{N}\exp(\text{sim}(v_i,t_j)/\tau)} \] \[ L_{V,T}=-\frac{1}{N}\sum_{i = 1}^{N}\log\frac{\exp(\text{sim}(v_i,t_i)/\tau)}{\sum_{j = 1}^{N}\exp(\text{sim}(v_j,t_i)/\tau)} \] 2. **Cosine Similarity**: \[ \text{sim}(v,t)=\cos(\text{Norm}(E_v(v)),\text{Norm}(E_t(t))) \] 3. **Final Loss Function**: \[ L = L_{V,T}+\lambda_vL_{V',T}+\lambda_tL_{V,T'} \] 4. **Reference - free Evaluation Score**: \[ \text{Score}(v,t)=w\cdot\max(\text{sim}(v,t),0) \] 5. **Reference - based Evaluation Score**: \[ \text{Ref - Scor}

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Learning to Evaluate Image Captioning

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Semantic Compositions Enhance Vision-Language Contrastive Learning

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

CLAIR: Evaluating Image Captions with Large Language Models

ScoreCL: Augmentation-Adaptive Contrastive Learning via Score-Matching Function

The nature of respiratory changes associated with sleep onset.

Improving Multimodal Datasets with Image Captioning

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Fine-grained Image Captioning with CLIP Reward

Cobra Effect in Reference-Free Image Captioning Metrics

Updating CLIP to Prefer Descriptions Over Captions

Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning