Abstract:Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. This issue extends to vision-language benchmarks, where it is difficult to make the image indispensable for accurate answer generation, particularly in vision question-answering tasks. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and to evaluate their effectiveness in achieving it. These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence rather than relying solely on linguistic priors, providing insights into the model's reliance on visual information. To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability. The code is available at <a class="link-external link-https" href="https://github.com/IntelLabs/fivl" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that large - scale visual - language models (LVLMs) fail to effectively utilize visual information in multimodal reasoning. Specifically, the author points out that current LVLMs often rely on language prior information when generating answers and do not fully utilize the image content, resulting in the "hallucination" phenomenon (that is, the generated text does not match the image). This phenomenon is particularly evident in visual question - answering tasks because these tasks require the combination of visual and text information simultaneously to obtain accurate answers. To solve this problem, the author proposes FiVL (Framework for Improved Vision - Language Alignment), a new method for constructing data sets to enhance the visual foundation of LVLMs and evaluate their effectiveness in visual - language alignment. FiVL achieves this goal in the following ways: 1. **Data set enhancement**: By expanding the existing data set and adding key expressions and their corresponding segmentation masks, ensure the close alignment between visual information and text information. 2. **New pre - training tasks**: A new task of jointly training text and visual tokens is introduced to generate more accurate visual - based masks. 3. **Evaluation framework**: A perturbation - based method is used to evaluate the degree of the model's dependence on visual input, ensuring that the model does indeed rely on the image content when generating answers. 4. **Explanatory applications**: Use the enhanced data set to improve the interpretability of the model and help understand how the model uses visual information. Through these improvements, the FiVL framework aims to improve the performance of LVLMs in multimodal tasks, especially in tasks that require precise visual understanding. ### Formulas involved To quantify the degree of the model's dependence on visual information, the author defines the **Visual Reliance Score (VRS)**, and its calculation formula is as follows: \[ \text{Visual Reliance Score}=\frac{\text{accuracy}_{\text{original}}-\text{accuracy}_{\text{perturb}}}{\text{accuracy}_{\text{original}}} \] where: - \(\text{accuracy}_{\text{original}}\) represents the accuracy of the model on the original image. - \(\text{accuracy}_{\text{perturb}}\) represents the accuracy of the model on the image after perturbation (such as occluding key areas). The higher this score, the stronger the model's dependence on visual information, and vice versa, indicating that the model relies more on language prior information.

FiVL: A Framework for Improved Vision-Language Alignment

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Towards Better Vision-Inspired Vision-Language Models

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Visually-Augmented Language Modeling