Comparison Visual Instruction Tuning

Wei Lin,Muhammad Jehanzeb Mirza,Sivan Doveh,Rogerio Feris,Raja Giryes,Sepp Hochreiter,Leonid Karlinsky

2024-06-13

Abstract:Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attention has been given to these fundamental concepts in the best current mimic of human visual intelligence - Large Multimodal Models (LMMs). We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI. Our approach significantly improves the CaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of related tasks by up to 17.5%. It is also complementary to existing difference-only instruction datasets, allowing automatic targeted refinement of those resources increasing their effectiveness for CaD tuning by up to 10%. Additionally, we propose an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the inadequacy of large multimodal models (LMMs) in understanding and generating commonalities and differences (CaD) between two images. Specifically, the paper proposes a novel two-stage method (CaD-VI) for collecting synthetic visual instructions and constructs a dataset containing 349K image pairs and their CaD instructions (CaD-Inst). Through this method, the paper significantly enhances the performance of LMMs on CaD tasks, achieving state-of-the-art levels on multiple related tasks. ### Main Contributions 1. **CaD-Inst Dataset**: Constructed a large-scale visual instruction tuning dataset to enhance the CaD reasoning capabilities of LMMs. 2. **CaD-QA Benchmark**: Proposed an open-ended question evaluation benchmark to assess the CaD understanding capabilities of LMMs. 3. **CaD-VI Method**: Proposed a two-stage method for collecting and enhancing CaD instruction tuning data. 4. **Performance Improvement**: Demonstrated significant improvements (up to 17.5%) on multiple existing closed-ended question evaluation benchmarks and strong relative improvements (over 20%) on open-ended question evaluation benchmarks using LMMs trained with CaD-Inst. ### Method Overview - **Stage 1**: Utilize a language model (LLM) to generate CaD summaries of image pairs, forming an initial CaD instruction dataset (CaD-InstV1). - **Stage 2**: Use the model trained in the first stage (CaD-LLaV AV1) to generate more CaD instruction data (CaD-InstV2), and combine the data from both stages to train the final model (CaD-LLaV AV2). ### Conclusion Through this method, the paper not only significantly enhances the performance of LMMs on CaD tasks but also provides an automatic refinement method for existing difference instruction datasets, thereby improving the effectiveness of these datasets.

Comparison Visual Instruction Tuning

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Towards Open-ended Visual Quality Comparison

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Aligning Large Multi-Modal Model with Robust Instruction Tuning

CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Vision-Language Instruction Tuning: A Review and Analysis

SVIT: Scaling up Visual Instruction Tuning

Instruction Makes a Difference

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Video Instruction Tuning With Synthetic Data

Compositional Image Retrieval via Instruction-Aware Contrastive Learning

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning