OneDiff: A Generalist Model for Image Difference Captioning

Erdong Hu,Longteng Guo,Tongtian Yue,Zijia Zhao,Shuning Xue,Jing Liu
2024-07-16
Abstract:In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied contexts. This paper introduces the OneDiff model, a novel generalist approach that utilizes a robust vision-language model architecture, integrating a siamese image encoder with a Visual Delta Module. This innovative configuration allows for the precise detection and articulation of fine-grained differences between image pairs. OneDiff is trained through a dual-phase strategy, encompassing Coupled Sample Training and multi-task learning across a diverse array of data types, supported by our newly developed DiffCap Dataset. This dataset merges real-world and synthetic data, enhancing the training process and bolstering the model's robustness. Extensive testing on diverse IDC benchmarks, such as Spot-the-Diff, CLEVR-Change, and Birds-to-Words, shows that OneDiff consistently outperforms existing state-of-the-art models in accuracy and adaptability, achieving improvements of up to 85\% CIDEr points in average. By setting a new benchmark in IDC, OneDiff paves the way for more versatile and effective applications in detecting and describing visual differences. The code, models, and data will be made publicly available.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in the Image Difference Captioning (IDC) task. Specifically, existing IDC methods usually rely on specialized models. Although these models perform well in specific scenarios, they lack the flexibility and generalization ability across different datasets and tasks. This limits their wide use in practical applications. #### Main problems include: 1. **Lack of generality**: - Most existing IDC methods focus on specific scenarios or tasks and are difficult to adapt to diverse application scenarios. For example, in the fields of medical images, environmental monitoring, and manufacturing quality control, a general - purpose model that can handle various subtle changes is required. 2. **Insufficient detail capture**: - Accurately describing the subtle differences between image pairs (such as changes in bird feathers, minor alterations in manufacturing parts, etc.) is a key challenge in the IDC task. Existing methods are insufficient in capturing these fine - grained differences. 3. **Scarcity of training data**: - High - quality IDC - labeled data is both expensive and time - consuming, resulting in the scarcity of training data. This poses an obstacle to model training and performance improvement. 4. **Difficulty in cross - modal alignment**: - Establishing an effective alignment relationship between visual and linguistic modalities is crucial for accurately describing image differences. Existing methods are not yet mature in this regard. To solve these problems, the paper proposes a new general - purpose model - OneDiff. This model addresses the above challenges in the following ways: - **Introducing the Visual Delta Module**: It is used to capture the fine - grained differences between image pairs. - **Adopting a two - stage training strategy**: Including Coupled Sample Training and multi - task learning to enhance the model's cross - modal alignment ability and generalization ability. - **Constructing the DiffCap dataset**: Integrating real - world and synthetic data to provide rich training samples and overcome the data scarcity problem. Through these innovations, OneDiff not only significantly outperforms existing methods in multiple IDC benchmark tests but also demonstrates its strong adaptability and efficiency in different tasks.