Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Xiyao Wang,Jiuhai Chen,Zhaoyang Wang,Yuhang Zhou,Yiyang Zhou,Huaxiu Yao,Tianyi Zhou,Tom Goldstein,Parminder Bhatia,Furong Huang,Cao Xiao
2024-06-08
Abstract:Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment between visual and language modalities. Previous methods to enhance this alignment typically require external models or data, heavily depending on their capabilities and quality, which inevitably sets an upper bound on performance. In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data. SIMA leverages prompts from existing vision instruction tuning datasets to self-generate responses and employs an in-context self-critic mechanism to select response pairs for preference tuning. The key innovation is the introduction of three vision metrics during the in-context self-critic process, which can guide the LVLM in selecting responses that enhance image comprehension. Through experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA not only improves model performance across all benchmarks but also achieves superior modality alignment, outperforming previous approaches.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in large - scale vision - language models (LVLMs), there is still significant room for improvement in the alignment between vision and language modalities. Although existing methods enhance this alignment by using external models or data, these methods rely on the capabilities and quality of external models and inevitably limit the upper limit of performance. In addition, due to the high cost of creating large - scale visual instruction datasets, existing methods face challenges in expanding visual instruction datasets. Meanwhile, the datasets distilled from third - party AI models used when fine - tuning LVLMs may cause the models to ignore image details and produce hallucination phenomena. To solve these problems, the authors propose a framework named SIMA (Self - Improvement Modality Alignment), aiming to further improve the alignment of vision and language modalities in LVLMs through a self - improvement mechanism. The key innovation points of SIMA are as follows: 1. **Self - generated responses**: Automatically generate responses using prompts in the existing visual - instruction - tuned datasets. 2. **Self - criticism in context**: Adopt a self - criticism mechanism in context to select response pairs for preference tuning. At this stage, three visual indicators are introduced to guide LVLMs to select responses that can enhance image understanding. 3. **Preference tuning**: Update the current LVLMs using self - rewarded response pairs. Through experiments, the authors prove that SIMA not only improves the performance of the model in 14 hallucination and comprehensive benchmark tests, but also significantly improves modality alignment, outperforming previous methods. Specifically, SIMA performs outstandingly in reducing hallucinations and improving understanding ability, especially when dealing with behavioral hallucinations of multi - image inputs. In conclusion, SIMA provides a method that does not rely on external models or data and effectively enhances the alignment of vision and language modalities in LVLMs through a self - improvement mechanism.