Mitigating Multilingual Hallucination in Large Vision-Language Models

Xiaoye Qu,Mingyang Song,Wei Wei,Jianfeng Dong,Yu Cheng
2024-08-01
Abstract:While Large Vision-Language Models (LVLMs) have exhibited remarkable capabilities across a wide range of tasks, they suffer from hallucination problems, where models generate plausible yet incorrect answers given the input image-query pair. This hallucination phenomenon is even more severe when querying the image in non-English languages, while existing methods for mitigating hallucinations in LVLMs only consider the English scenarios. In this paper, we make the first attempt to mitigate this important multilingual hallucination in LVLMs. With thorough experiment analysis, we found that multilingual hallucination in LVLMs is a systemic problem that could arise from deficiencies in multilingual capabilities or inadequate multimodal abilities. To this end, we propose a two-stage Multilingual Hallucination Removal (MHR) framework for LVLMs, aiming to improve resistance to hallucination for both high-resource and low-resource languages. Instead of relying on the intricate manual annotations of multilingual resources, we fully leverage the inherent capabilities of the LVLM and propose a novel cross-lingual alignment method, which generates multiple responses for each image-query input and then identifies the hallucination-aware pairs for each language. These data pairs are finally used for direct preference optimization to prompt the LVLMs to favor non-hallucinating responses. Experimental results show that our MHR achieves a substantial reduction in hallucination generation for LVLMs. Notably, on our extended multilingual POPE benchmark, our framework delivers an average increase of 19.0% in accuracy across 13 different languages. Our code and model weights are available at <a class="link-external link-https" href="https://github.com/ssmisya/MHR" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper aims to address the hallucination problem in large-scale vision-language models (LVLMs) within multilingual environments. Specifically, although existing LVLMs perform well on various tasks, they tend to generate seemingly plausible but actually incorrect answers when handling image-query pairs, a phenomenon known as "hallucination." This issue is particularly severe for queries in non-English languages. Existing methods mainly focus on mitigating hallucination problems in English environments, neglecting this significant issue in multilingual contexts. To tackle this challenge, the authors propose a two-stage Multilingual Hallucination Removal (MHR) framework designed to enhance the hallucination resistance of LVLMs in both high-resource and low-resource languages. Through detailed experimental analysis, the authors find that the multilingual hallucination problem may be due to insufficient multilingual capability or insufficient multimodal capability. The specific steps of the MHR framework are as follows: 1. **Stage 1: Enhance Multilingual Instruction Following Capability** - Improve the model's understanding of instructions in different languages through multilingual supervised fine-tuning. This step is crucial for enhancing the model's query comprehension in various languages. 2. **Stage 2: Enhance Hallucination Resistance** - Propose a novel cross-lingual alignment method that leverages the intrinsic capabilities of LVLMs to generate multiple responses and identify hallucination-aware pairs based on cross-lingual alignment metrics. - Use these data pairs for direct preference optimization, encouraging LVLMs to generate non-hallucinated answers. Through training in these two stages, the MHR framework significantly reduces hallucination generation in multilingual environments. Experimental results show that MHR improves accuracy by an average of 19.0% on the extended multilingual POPE benchmark, covering 13 different languages.