Debiasing Multimodal Large Language Models

Yi-Fan Zhang,Weichen Yu,Qingsong Wen,Xue Wang,Zhang Zhang,Liang Wang,Rong Jin,Tieniu Tan
2024-03-27
Abstract:In the realms of computer vision and natural language processing, Large Vision-Language Models (LVLMs) have become indispensable tools, proficient in generating textual descriptions based on visual inputs. Despite their advancements, our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior rather than the input image. Our empirical experiments underscore the persistence of this bias, as LVLMs often provide confident answers even in the absence of relevant images or given incongruent visual input. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies. Firstly, for tasks such as classification or multi-choice question-answering (QA), we propose a ``calibration'' step through affine transformation to adjust the output distribution. This ``Post-Hoc debias'' approach ensures uniform scores for each answer when the image is absent, serving as an effective regularization technique to alleviate the influence of LLM priors. For more intricate open-ended generation tasks, we extend this method to ``Debias sampling'', drawing inspirations from contrastive decoding methods. Furthermore, our investigation sheds light on the instability of LVLMs across various decoding configurations. Through systematic exploration of different settings, we significantly enhance performance, surpassing reported results and raising concerns about the fairness of existing evaluations. Comprehensive experiments substantiate the effectiveness of our proposed strategies in mitigating biases. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the significant bias in the content generated by large - scale vision - language models (LVLMs). Specifically, the research has found that when generating text descriptions, these models rely more on the prior knowledge of the underlying large - scale language models (LLMs) rather than the input image information. Even when there is no relevant image or when inconsistent visual input is given, LVLMs can still confidently generate answers, which indicates that the model has a bias in learning language patterns. This bias causes the content generated by the model to be inconsistent with the actual input image, thus affecting the reliability and applicability of the model, especially in application scenarios that require accurate image description. To correct these biases, the author proposes two strategies without training: 1. **Calibration**: Adjust the output distribution through affine transformation to ensure that when there is no image input, the probability distribution of each answer is uniform. This serves as an effective regularization technique to reduce the influence of LLMs' prior. 2. **Debias Sampling**: For more complex open - generation tasks, the calibration method is extended, drawing on the idea of contrastive decoding methods. Calculate the difference between the log - probabilities of generated tokens when there is a correct image and when there is meaningless visual input, in order to reduce the dependence of the generation results on plain - text or meaningless image input. In addition, the paper also explores the impact of different decoding configurations on LVLM performance, pointing out that existing evaluation methods are usually based on default decoding settings, which limit the comprehensive exploration of model capabilities. Through systematically searching for different decoding configurations, the author finds that the performance of LVLM can be significantly improved, exceeding previously reported results, emphasizing the importance of choosing the best decoding configuration.