Abstract:Image captioning (IC) systems, which automatically generate a text description of the salient objects in an image (real or synthetic), have seen great progress over the past few years due to the development of deep neural networks. IC plays an indispensable role in human society, for example, labeling massive photos for scientific studies and assisting visually-impaired people in perceiving the world. However, even the top-notch IC systems, such as Microsoft Azure Cognitive Services and IBM Image Caption Generator, may return incorrect results, leading to the omission of important objects, deep misunderstanding, and threats to personal safety. To address this problem, we propose MetaIC, the \textit{first} metamorphic testing approach to validate IC systems. Our core idea is that the object names should exhibit directional changes after object insertion. Specifically, MetaIC (1) extracts objects from existing images to construct an object corpus; (2) inserts an object into an image via novel object resizing and location tuning algorithms; and (3) reports image pairs whose captions do not exhibit differences in an expected way. In our evaluation, we use MetaIC to test one widely-adopted image captioning API and five state-of-the-art (SOTA) image captioning models. Using 1,000 seeds, MetaIC successfully reports 16,825 erroneous issues with high precision (84.9\%-98.4\%). There are three kinds of errors: misclassification, omission, and incorrect quantity. We visualize the errors reported by MetaIC, which shows that flexible overlapping setting facilitates IC testing by increasing and diversifying the reported errors. In addition, MetaIC can be further generalized to detect label errors in the training dataset, which has successfully detected 151 incorrect labels in MS COCO Caption, a standard dataset in image captioning.

Microsoft COCO Captions: Data Collection and Evaluation Server

COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval

From Captions to Visual Concepts and Back

ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO

IC3: Image Captioning by Committee Consensus

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Improving Multimodal Datasets with Image Captioning

Learning to Evaluate Image Captioning

NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning

Evaluating authenticity and quality of image captions via sentiment and semantic analyses

Visuals to Text: A Comprehensive Review on Automatic Image Captioning

Self-Distillation for Few-Shot Image Captioning (Supplementary Materials)

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method

Benchmarking and Improving Detail Image Caption

A Comprehensive Analysis of Real-World Image Captioning and Scene Identification

D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals

Automated Testing of Image Captioning Systems

Improving Image Captioning with Better Use of Caption

Mitigating Gender Bias in Captioning Systems

TextCaps: a Dataset for Image Captioning with Reading Comprehension