FiVL: A Framework for Improved Vision-Language Alignment

Estelle Aflalo,Gabriela Ben Melech Stan,Tiep Le,Man Luo,Shachar Rosenman,Sayak Paul,Shao-Yen Tseng,Vasudev Lal
2024-12-19
Abstract:Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. This issue extends to vision-language benchmarks, where it is difficult to make the image indispensable for accurate answer generation, particularly in vision question-answering tasks. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and to evaluate their effectiveness in achieving it. These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence rather than relying solely on linguistic priors, providing insights into the model's reliance on visual information. To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability. The code is available at <a class="link-external link-https" href="https://github.com/IntelLabs/fivl" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that large - scale visual - language models (LVLMs) fail to effectively utilize visual information in multimodal reasoning. Specifically, the author points out that current LVLMs often rely on language prior information when generating answers and do not fully utilize the image content, resulting in the "hallucination" phenomenon (that is, the generated text does not match the image). This phenomenon is particularly evident in visual question - answering tasks because these tasks require the combination of visual and text information simultaneously to obtain accurate answers. To solve this problem, the author proposes FiVL (Framework for Improved Vision - Language Alignment), a new method for constructing data sets to enhance the visual foundation of LVLMs and evaluate their effectiveness in visual - language alignment. FiVL achieves this goal in the following ways: 1. **Data set enhancement**: By expanding the existing data set and adding key expressions and their corresponding segmentation masks, ensure the close alignment between visual information and text information. 2. **New pre - training tasks**: A new task of jointly training text and visual tokens is introduced to generate more accurate visual - based masks. 3. **Evaluation framework**: A perturbation - based method is used to evaluate the degree of the model's dependence on visual input, ensuring that the model does indeed rely on the image content when generating answers. 4. **Explanatory applications**: Use the enhanced data set to improve the interpretability of the model and help understand how the model uses visual information. Through these improvements, the FiVL framework aims to improve the performance of LVLMs in multimodal tasks, especially in tasks that require precise visual understanding. ### Formulas involved To quantify the degree of the model's dependence on visual information, the author defines the **Visual Reliance Score (VRS)**, and its calculation formula is as follows: \[ \text{Visual Reliance Score}=\frac{\text{accuracy}_{\text{original}}-\text{accuracy}_{\text{perturb}}}{\text{accuracy}_{\text{original}}} \] where: - \(\text{accuracy}_{\text{original}}\) represents the accuracy of the model on the original image. - \(\text{accuracy}_{\text{perturb}}\) represents the accuracy of the model on the image after perturbation (such as occluding key areas). The higher this score, the stronger the model's dependence on visual information, and vice versa, indicating that the model relies more on language prior information.