Abstract:In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied contexts. This paper introduces the OneDiff model, a novel generalist approach that utilizes a robust vision-language model architecture, integrating a siamese image encoder with a Visual Delta Module. This innovative configuration allows for the precise detection and articulation of fine-grained differences between image pairs. OneDiff is trained through a dual-phase strategy, encompassing Coupled Sample Training and multi-task learning across a diverse array of data types, supported by our newly developed DiffCap Dataset. This dataset merges real-world and synthetic data, enhancing the training process and bolstering the model's robustness. Extensive testing on diverse IDC benchmarks, such as Spot-the-Diff, CLEVR-Change, and Birds-to-Words, shows that OneDiff consistently outperforms existing state-of-the-art models in accuracy and adaptability, achieving improvements of up to 85\% CIDEr points in average. By setting a new benchmark in IDC, OneDiff paves the way for more versatile and effective applications in detecting and describing visual differences. The code, models, and data will be made publicly available.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in the Image Difference Captioning (IDC) task. Specifically, existing IDC methods usually rely on specialized models. Although these models perform well in specific scenarios, they lack the flexibility and generalization ability across different datasets and tasks. This limits their wide use in practical applications. #### Main problems include: 1. **Lack of generality**: - Most existing IDC methods focus on specific scenarios or tasks and are difficult to adapt to diverse application scenarios. For example, in the fields of medical images, environmental monitoring, and manufacturing quality control, a general - purpose model that can handle various subtle changes is required. 2. **Insufficient detail capture**: - Accurately describing the subtle differences between image pairs (such as changes in bird feathers, minor alterations in manufacturing parts, etc.) is a key challenge in the IDC task. Existing methods are insufficient in capturing these fine - grained differences. 3. **Scarcity of training data**: - High - quality IDC - labeled data is both expensive and time - consuming, resulting in the scarcity of training data. This poses an obstacle to model training and performance improvement. 4. **Difficulty in cross - modal alignment**: - Establishing an effective alignment relationship between visual and linguistic modalities is crucial for accurately describing image differences. Existing methods are not yet mature in this regard. To solve these problems, the paper proposes a new general - purpose model - OneDiff. This model addresses the above challenges in the following ways: - **Introducing the Visual Delta Module**: It is used to capture the fine - grained differences between image pairs. - **Adopting a two - stage training strategy**: Including Coupled Sample Training and multi - task learning to enhance the model's cross - modal alignment ability and generalization ability. - **Constructing the DiffCap dataset**: Integrating real - world and synthetic data to provide rich training samples and overcome the data scarcity problem. Through these innovations, OneDiff not only significantly outperforms existing methods in multiple IDC benchmark tests but also demonstrates its strong adaptability and efficiency in different tasks.

OneDiff: A Generalist Model for Image Difference Captioning

CLIP4IDC: CLIP for Image Difference Captioning

Describing Differences in Image Sets with Natural Language

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Exploring Discrete Diffusion Models for Image Captioning

Revisiting image captioning via maximum discrepancy competition

Improving Reference-based Distinctive Image Captioning with Contrastive Rewards

Context-aware Difference Distilling for Multi-change Captioning

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

CaptionNet: Automatic End-to-End Siamese Difference Captioning Model with Attention

Rethinking the Reference-based Distinctive Image Captioning

Deconfounded Image Captioning: A Causal Retrospect

Bidirectional difference locating and semantic consistency reasoning for change captioning

One Diffusion to Generate Them All

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Group-Based Distinctive Image Captioning with Memory Difference Encoding and Attention

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

CamDiff: Camouflage Image Augmentation via Diffusion Model