Abstract:Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment between visual and language modalities. Previous methods to enhance this alignment typically require external models or data, heavily depending on their capabilities and quality, which inevitably sets an upper bound on performance. In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data. SIMA leverages prompts from existing vision instruction tuning datasets to self-generate responses and employs an in-context self-critic mechanism to select response pairs for preference tuning. The key innovation is the introduction of three vision metrics during the in-context self-critic process, which can guide the LVLM in selecting responses that enhance image comprehension. Through experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA not only improves model performance across all benchmarks but also achieves superior modality alignment, outperforming previous approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in large - scale vision - language models (LVLMs), there is still significant room for improvement in the alignment between vision and language modalities. Although existing methods enhance this alignment by using external models or data, these methods rely on the capabilities and quality of external models and inevitably limit the upper limit of performance. In addition, due to the high cost of creating large - scale visual instruction datasets, existing methods face challenges in expanding visual instruction datasets. Meanwhile, the datasets distilled from third - party AI models used when fine - tuning LVLMs may cause the models to ignore image details and produce hallucination phenomena. To solve these problems, the authors propose a framework named SIMA (Self - Improvement Modality Alignment), aiming to further improve the alignment of vision and language modalities in LVLMs through a self - improvement mechanism. The key innovation points of SIMA are as follows: 1. **Self - generated responses**: Automatically generate responses using prompts in the existing visual - instruction - tuned datasets. 2. **Self - criticism in context**: Adopt a self - criticism mechanism in context to select response pairs for preference tuning. At this stage, three visual indicators are introduced to guide LVLMs to select responses that can enhance image understanding. 3. **Preference tuning**: Update the current LVLMs using self - rewarded response pairs. Through experiments, the authors prove that SIMA not only improves the performance of the model in 14 hallucination and comprehensive benchmark tests, but also significantly improves modality alignment, outperforming previous methods. Specifically, SIMA performs outstandingly in reducing hallucinations and improving understanding ability, especially when dealing with behavioral hallucinations of multi - image inputs. In conclusion, SIMA provides a method that does not rely on external models or data and effectively enhances the alignment of vision and language modalities in LVLMs through a self - improvement mechanism.

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Calibrated Self-Rewarding Vision Language Models

Tackling Vision Language Tasks Through Learning Inner Monologues

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

Improving Visual Storytelling with Multimodal Large Language Models

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Visually-Augmented Language Modeling

Contrastive Vision-Language Alignment Makes Efficient Instruction Learner

X-VILA: Cross-Modality Alignment for Large Language Model

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Towards Multimodal In-Context Learning for Vision & Language Models

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

VIGC: Visual Instruction Generation and Correction

Self-Supervised Visual Preference Alignment