Abstract:In the realms of computer vision and natural language processing, Large Vision-Language Models (LVLMs) have become indispensable tools, proficient in generating textual descriptions based on visual inputs. Despite their advancements, our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior rather than the input image. Our empirical experiments underscore the persistence of this bias, as LVLMs often provide confident answers even in the absence of relevant images or given incongruent visual input. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies. Firstly, for tasks such as classification or multi-choice question-answering (QA), we propose a ``calibration'' step through affine transformation to adjust the output distribution. This ``Post-Hoc debias'' approach ensures uniform scores for each answer when the image is absent, serving as an effective regularization technique to alleviate the influence of LLM priors. For more intricate open-ended generation tasks, we extend this method to ``Debias sampling'', drawing inspirations from contrastive decoding methods. Furthermore, our investigation sheds light on the instability of LVLMs across various decoding configurations. Through systematic exploration of different settings, we significantly enhance performance, surpassing reported results and raising concerns about the fairness of existing evaluations. Comprehensive experiments substantiate the effectiveness of our proposed strategies in mitigating biases. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the significant bias in the content generated by large - scale vision - language models (LVLMs). Specifically, the research has found that when generating text descriptions, these models rely more on the prior knowledge of the underlying large - scale language models (LLMs) rather than the input image information. Even when there is no relevant image or when inconsistent visual input is given, LVLMs can still confidently generate answers, which indicates that the model has a bias in learning language patterns. This bias causes the content generated by the model to be inconsistent with the actual input image, thus affecting the reliability and applicability of the model, especially in application scenarios that require accurate image description. To correct these biases, the author proposes two strategies without training: 1. **Calibration**: Adjust the output distribution through affine transformation to ensure that when there is no image input, the probability distribution of each answer is uniform. This serves as an effective regularization technique to reduce the influence of LLMs' prior. 2. **Debias Sampling**: For more complex open - generation tasks, the calibration method is extended, drawing on the idea of contrastive decoding methods. Calculate the difference between the log - probabilities of generated tokens when there is a correct image and when there is meaningless visual input, in order to reduce the dependence of the generation results on plain - text or meaningless image input. In addition, the paper also explores the impact of different decoding configurations on LVLM performance, pointing out that existing evaluation methods are usually based on default decoding settings, which limit the comprehensive exploration of model capabilities. Through systematically searching for different decoding configurations, the author finds that the performance of LVLM can be significantly improved, exceeding previously reported results, emphasizing the importance of choosing the best decoding configuration.

Debiasing Multimodal Large Language Models

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

Mitigating Multilingual Hallucination in Large Vision-Language Models

Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

Uncovering Bias in Large Vision-Language Models with Counterfactuals

Social Debiasing for Fair Multi-modal LLMs

Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

Towards Trustworthy LLMs: a Review on Debiasing and Dehallucinating in Large Language Models

A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks

Can We Debias Multimodal Large Language Models Via Model Editing?

A Multi-LLM Debiasing Framework

Do Multimodal Large Language Models See Like Humans?

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

Hallucination of Multimodal Large Language Models: A Survey

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models