Abstract:Multimodal Large Language Models (MLLMs) excel in generating responses based on visual inputs. However, they often suffer from a bias towards generating responses similar to their pretraining corpus, overshadowing the importance of visual information. We treat this bias as a "preference" for pretraining statistics, which hinders the model's grounding in visual input. To mitigate this issue, we propose Bootstrapped Preference Optimization (BPO), which conducts preference learning with datasets containing negative responses bootstrapped from the model itself. Specifically, we propose the following two strategies: 1) using distorted image inputs to the MLLM for eliciting responses that contain signified pretraining bias; 2) leveraging text-based LLM to explicitly inject erroneous but common elements into the original response. Those undesirable responses are paired with original annotated responses from the datasets to construct the preference dataset, which is subsequently utilized to perform preference learning. Our approach effectively suppresses pretrained LLM bias, enabling enhanced grounding in visual inputs. Extensive experimentation demonstrates significant performance improvements across multiple benchmarks, advancing the state-of-the-art in multimodal conversational systems.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of bias in multimodal large language models (MLLMs) when generating responses based on visual inputs. Specifically, these models tend to generate responses similar to their pre-training corpus, neglecting the importance of visual information. This bias leads to poor performance when handling image inputs, resulting in errors or hallucinations, such as generating non-existent objects, misidentifying attributes, or providing inaccurate object counts. These issues make MLLMs unreliable in high-risk real-world applications, such as autonomous driving systems or medical assistants. ### Solution To solve this problem, the authors propose the Bootstrapped Preference Optimization (BPO) method. BPO constructs a preference dataset through the following two strategies: 1. **Image Perturbation Prompts**: By distorting image features, the model generates responses that contain significant pre-training biases. These responses, while related to the image input, are closer to the pre-training distribution, thereby exposing the model's biases. 2. **LLM Bias Injection**: Utilizing the LLM component within the MLLM to directly modify the original responses, generating negative samples that contain common but incorrect elements. These negative samples are paired with the original annotations to construct the preference dataset. Through these strategies, BPO effectively suppresses the biases of the pre-trained LLM, enhancing the model's reliance on visual inputs. Experimental results show that BPO significantly improves performance across multiple benchmarks and greatly reduces object hallucination phenomena. ### Main Contributions 1. **Novel Perspective**: Treating multimodal alignment as a preference learning task, where pre-training biases and visual grounding are viewed as old and new preferences, respectively. 2. **Automatic Construction of Preference Dataset**: Proposing a method for large-scale automatic generation of negative samples, effectively exposing the pre-training biases of MLLMs. 3. **Empirical Validation**: Demonstrating through experiments that the BPO method significantly improves the grounding effect of MLLMs on image inputs and achieves performance improvements across multiple benchmarks.

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Self-Boosting Large Language Models with Synthetic Preference Data

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

On-the-fly Modulation for Balanced Multimodal Learning

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Automated Multi-level Preference for MLLMs

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Debiasing Multimodal Large Language Models

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Aligning Large Language Model with Direct Multi-Preference Optimization for Recommendation

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models