Abstract:Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at <a class="link-external link-https" href="https://github.com/si0wang/VisVM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the response quality of Vision - Language Models (VLMs) during inference time. In particular, it aims to reduce the hallucination phenomenon in generated descriptions (that is, the generated content does not match the actual content of the image) by enhancing the inference - time computation and provide more abundant visual details. Although Vision - Language Models have made remarkable progress in multi - modal tasks, they still have problems of visual hallucination and ignoring unobtrusive image areas, which limit their applications in the real world. Therefore, the paper proposes a new method - Vision Value Model (VisVM), which is designed to generate responses with better visual understanding ability by guiding the VLM search process during inference time. Specifically, VisVM can not only evaluate the quality of the generated sentences in the current search step, but also predict the long - term value that may be brought by the subsequently generated sentences, thereby avoiding response candidates that may lead to a high risk of future hallucination and finally generating high - quality image descriptions. The paper verifies the effectiveness of VisVM through two main experiments: 1. Using VisVM as a guiding signal for VLM inference - time search to generate descriptive image captions, it is observed that the hallucination phenomenon is significantly reduced and the image descriptions are more detailed. 2. Using the descriptive captions generated under the guidance of VisVM as Supervised Fine - Tuning (SFT) data to self - train the original VLM, the results show that in eight standard benchmark tests, the VisVM - guided self - training improves the performance by an average of 10.8%. These contributions not only demonstrate the effectiveness of VisVM in improving the visual understanding ability of VLM, but also propose a powerful self - training pipeline that can further enhance the performance of VLM.

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Calibrated Self-Rewarding Vision Language Models

Towards Better Vision-Inspired Vision-Language Models

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Visually-Augmented Language Modeling

EVLM: An Efficient Vision-Language Model for Visual Understanding

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Vision-Language Models for Vision Tasks: A Survey

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

HumanVLM: Foundation for Human-Scene Vision-Language Model

VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

Vision Language Models are In-Context Value Learners