Abstract:Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions. However, they have some insuperable drawbacks, e.g., they cannot handle videos without references, and they may result in biased evaluation due to the one-to-many nature of video-to-text and the neglect of visual relevance. From the human evaluator's viewpoint, a high-quality caption should be consistent with the provided video, but not necessarily be similar to the reference in literal or semantics. Inspired by human evaluation, we propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning, which directly measures similarity between video and candidate captions. Benefit from the recent development of large-scale pre-training models, we exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore. Specifically, EMScore combines matching scores of both coarse-grained (video and caption) and fine-grained (frames and words) levels, which takes the overall understanding and detailed characteristics of the video into account. Furthermore, considering the potential information gain, EMScore can be flexibly extended to the conditions where human-labeled references are available. Last but not least, we collect VATEX-EVAL and ActivityNet-FOIl datasets to systematically evaluate the existing metrics. VATEX-EVAL experiments demonstrate that EMScore has higher human correlation and lower reference dependency. ActivityNet-FOIL experiment verifies that EMScore can effectively identify "hallucinating" captions. The datasets will be released to facilitate the development of video captioning metrics. The code is available at: <a class="link-external link-https" href="https://github.com/ShiYaya/emscore" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in video captioning evaluation: 1. **The problem of lack of reference videos**: Most of the current evaluation metrics for video captioning are based on text - level comparison and require reference captions. However, this method relying on reference captions has limitations in practical applications, especially when these evaluation metrics cannot be used without reference captions. 2. **The problem of evaluation bias**: Existing evaluation metrics may lead to evaluation result bias due to the one - to - many nature from video to text and the neglect of visual relevance. For example, some correct captions may be underestimated because they are not completely consistent with the reference captions literally or semantically. 3. **The problem of hallucination description**: Existing evaluation metrics may not be able to effectively identify "hallucination" descriptions, that is, those captions that do not match the video content but are similar to the reference captions. This leads to the inaccuracy of evaluation results. To solve the above problems, the author proposes a new evaluation metric named EMScore. EMScore is a reference - free evaluation metric based on embedding matching and can directly measure the similarity between the video and the candidate captions. Specifically, EMScore utilizes large - scale pre - trained vision - language models to extract visual and language embeddings and combines coarse - grained (the whole video and caption) and fine - grained (frame and word) matching scores to comprehensively evaluate the quality of captions. ### Main contributions: 1. **Propose EMScore**: A reference - free video caption evaluation metric that can directly measure the consistency between video content and candidate captions while considering coarse - grained and fine - grained embedding matching. 2. **Expand to the reference - available condition**: EMScore can be flexibly extended to the situation with reference captions, further improving the evaluation accuracy. 3. **Collect data sets**: The author collects two data sets, VATEX - EVAL and ActivityNet - FOIL, which are used to study the correlation between evaluation metrics and human judgment and the sensitivity to "hallucination" descriptions respectively. ### Experimental results: - **VATEX - EVAL data set**: The experimental results show that EMScore is superior to existing automatic evaluation metrics in terms of human correlation and has a lower dependence on reference captions. - **ActivityNet - FOIL data set**: The experiment verifies the effectiveness of EMScore in identifying "hallucination" descriptions. Through these contributions, EMScore provides a new and more accurate tool for the evaluation of video captioning.

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o

Learning to Evaluate Image Captioning

Cobra Effect in Reference-Free Image Captioning Metrics

EVQAScore: Efficient Video Question Answering Data Evaluation

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Contrastive Semantic Similarity Learning for Image Captioning Evaluation

EvCap: Element-Aware Video Captioning

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Resource-Efficient Reference-Free Evaluation of Audio Captions

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Edit As You Wish: Video Caption Editing with Multi-grained User Control

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

DARTScore: DuAl-Reconstruction Transformer for Video Captioning Evaluation

Enhanced Video Caption Generation Based on Multimodal Features.

Learning Video-Text Aligned Representations for Video Captioning

Research on Video Captioning Based on Multifeature Fusion.

Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation