Abstract:The task of image difference captioning aims at locating changed objects in similar image pairs and describing the difference with natural language. The key challenges of this task are to comprehend the context of image pairs sufficiently and locate the changed objects accurately in the presence of viewpoint change. Previous studies focus on pixel-level image features, neglecting rich explicit features of objects in an image pair which are beneficial to generate a fine-grained difference caption. Additionally, existing generative models suffer from accurately locate the differences in the interference of viewpoint change. To address these issues, we propose an Instance-Level Fine-Grained Difference Captioning (IFDC) model, which consists of a fine-grained feature extraction module, a multi-round feature fusion module, a similarity-based difference finding module, and a difference captioning module. To describe the changed objects comprehensively, we extract the fine-grained features, i.e., visual features, semantic features, and positional features at instance-level, as the objects' representation. To enhance the model's immunity to viewpoint change, we design a similarity-based difference finding module to locate the changed objects accurately. Extensive experiments show that our IFDC model achieves comparable performance with the state-of-the-art models on the datasets of CLEVR-Change and Spot-the-Diff, thus verifying the effectiveness of our proposed model. Our source code is available at https://github.com/VISLANG-Lab/IFDC.

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

Audio Difference Learning for Audio Captioning

EDTC: enhance depth of text comprehension in automated audio captioning

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

CaptionNet: Automatic End-to-End Siamese Difference Captioning Model with Attention

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Context-aware Difference Distilling for Multi-change Captioning

Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee

Image Difference Captioning With Instance-Level Fine-Grained Feature Representation

Semantic Relation-aware Difference Representation Learning for Change Captioning

Local Information Assisted Attention-Free Decoder for Audio Captioning

Bidirectional difference locating and semantic consistency reasoning for change captioning

Image Difference Captioning with Pre-training and Contrastive Learning

OneDiff: A Generalist Model for Image Difference Captioning

Towards Diverse and Efficient Audio Captioning via Diffusion Models

Icnn-Transformer: an Improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning

Cacophony: An Improved Contrastive Audio-Text Model

Neighborhood Contrastive Transformer for Change Captioning

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

Efficient Audio Captioning with Encoder-Level Knowledge Distillation