Data-augmented phrase-level alignment for mitigating object hallucination

Pritam Sarkar,Sayna Ebrahimi,Ali Etemad,Ahmad Beirami,Sercan Ö. Arık,Tomas Pfister
2024-10-09
Abstract:Despite their significant advancements, Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination. In this work, we address object hallucinations in MLLMs, where information is generated about an object not present in the input image. We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations, while preserving their general vision-language capabilities. To fine-tune MLLMs with DPA, we first generate a set of `hallucinated' and `correct' response pairs through generative data augmentation by selectively altering the ground-truth information of the correct responses at a phrase level. The DPA loss is then used to train MLLMs to reduce the likelihood of hallucinated phrases compared to the correct ones. Our thorough evaluation on various benchmarks confirms the effectiveness of DPA in mitigating hallucination while retaining the out-of-the-box performance of the MLLMs on general tasks. For instance, MLLMs finetuned with DPA, which we refer to as Hallucination Attenuated Language and Vision Assistant (HALVA), improve F1 by up to 13.4% on hallucination visual question-answering and reduce the hallucination rate by up to 4.2% on image description tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is object hallucination in Multimodal Large Language Models (MLLMs). Specifically, when these models process image inputs, they may generate information about objects that do not exist in the image, and this phenomenon is called "object hallucination". The paper proposes a new method - Data - augmented Phrase - level Alignment (DPA), aiming to reduce object hallucination generated by MLLMs when generating text while maintaining their general capabilities in visual - language tasks. ### Main Contributions 1. **DPA Method**: The paper introduces the DPA loss function, which can effectively alleviate the object hallucination problem in MLLMs without increasing inference time and without the need for a large amount of retraining. 2. **Generated Data Augmentation**: By selectively modifying real concepts in correct responses, pairs of "hallucination" and "correct" responses are generated for training the model. 3. **Performance Evaluation**: The paper conducts strict evaluations on multiple benchmarks, demonstrating that the DPA method can maintain or improve the performance of the model in general visual - language tasks while reducing hallucination. ### Method Overview - **Generated Data Augmentation**: By selectively modifying real concepts in correct responses, pairs of responses containing hallucination information are generated. For example, "A young man is skateboarding in a skate park wearing a white shirt and blue jeans" is modified to "A young woman is roller - skating in a roller rink wearing a black dress and red sneakers". - **DPA Loss Function**: The DPA loss function consists of two parts: - **Alignment Loss (L_a)**: Calculate the relative log - probability of the hallucination phrase with respect to the correct phrase to reduce the generation probability of the hallucination phrase. - **KL Divergence Regularizer (L_d)**: Use a frozen reference model to limit the deviation of the model from its initial state during training. ### Experimental Setup - **Training Data**: Prepare visual - language instructions and their corresponding correct and hallucination responses based on the Visual Genome dataset. - **Implementation Details**: Use LLaV A - v1.5 and VILA - v1.5 as base models, freeze the visual encoder and projection layer, and only train the language model. - **Evaluation Setup**: Conduct evaluations on multiple object hallucination benchmarks (such as CHAIR, MME - Hall, AMBER, MmHal - Bench) and general visual - language benchmarks (such as VQA - v2, MM - Vet, TextVQA, MME). ### Results - **Image Captioning Task**: HALV A increases the coverage of real objects in the image while reducing the hallucination rate. - **Question - Answering Task**: HALV A performs excellently in visual question - answering tasks, reducing hallucination and improving overall performance. - **Hallucination Evaluation**: On multiple benchmarks, HALV A significantly reduces the hallucination rate while maintaining the model's descriptive ability and detail. In conclusion, through the introduction of the DPA method, this paper effectively solves the object hallucination problem in MLLMs while maintaining the performance of the model in general visual - language tasks.