Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

Renjie Pi,Tianyang Han,Wei Xiong,Jipeng Zhang,Runtao Liu,Rui Pan,Tong Zhang
2024-04-03
Abstract:Multimodal Large Language Models (MLLMs) excel in generating responses based on visual inputs. However, they often suffer from a bias towards generating responses similar to their pretraining corpus, overshadowing the importance of visual information. We treat this bias as a "preference" for pretraining statistics, which hinders the model's grounding in visual input. To mitigate this issue, we propose Bootstrapped Preference Optimization (BPO), which conducts preference learning with datasets containing negative responses bootstrapped from the model itself. Specifically, we propose the following two strategies: 1) using distorted image inputs to the MLLM for eliciting responses that contain signified pretraining bias; 2) leveraging text-based LLM to explicitly inject erroneous but common elements into the original response. Those undesirable responses are paired with original annotated responses from the datasets to construct the preference dataset, which is subsequently utilized to perform preference learning. Our approach effectively suppresses pretrained LLM bias, enabling enhanced grounding in visual inputs. Extensive experimentation demonstrates significant performance improvements across multiple benchmarks, advancing the state-of-the-art in multimodal conversational systems.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of bias in multimodal large language models (MLLMs) when generating responses based on visual inputs. Specifically, these models tend to generate responses similar to their pre-training corpus, neglecting the importance of visual information. This bias leads to poor performance when handling image inputs, resulting in errors or hallucinations, such as generating non-existent objects, misidentifying attributes, or providing inaccurate object counts. These issues make MLLMs unreliable in high-risk real-world applications, such as autonomous driving systems or medical assistants. ### Solution To solve this problem, the authors propose the Bootstrapped Preference Optimization (BPO) method. BPO constructs a preference dataset through the following two strategies: 1. **Image Perturbation Prompts**: By distorting image features, the model generates responses that contain significant pre-training biases. These responses, while related to the image input, are closer to the pre-training distribution, thereby exposing the model's biases. 2. **LLM Bias Injection**: Utilizing the LLM component within the MLLM to directly modify the original responses, generating negative samples that contain common but incorrect elements. These negative samples are paired with the original annotations to construct the preference dataset. Through these strategies, BPO effectively suppresses the biases of the pre-trained LLM, enhancing the model's reliance on visual inputs. Experimental results show that BPO significantly improves performance across multiple benchmarks and greatly reduces object hallucination phenomena. ### Main Contributions 1. **Novel Perspective**: Treating multimodal alignment as a preference learning task, where pre-training biases and visual grounding are viewed as old and new preferences, respectively. 2. **Automatic Construction of Preference Dataset**: Proposing a method for large-scale automatic generation of negative samples, effectively exposing the pre-training biases of MLLMs. 3. **Empirical Validation**: Demonstrating through experiments that the BPO method significantly improves the grounding effect of MLLMs on image inputs and achieves performance improvements across multiple benchmarks.