Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Weiyun Wang,Zhe Chen,Wenhai Wang,Yue Cao,Yangzhou Liu,Zhangwei Gao,Jinguo Zhu,Xizhou Zhu,Lewei Lu,Yu Qiao,Jifeng Dai

2024-11-16

Abstract:Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the poor performance of existing open - source multimodal large language models (MLLMs) in Chain - of - Thought (CoT) reasoning. Specifically, these models perform well when answering with direct answers, but their performance degrades when giving CoT answers that require a detailed reasoning process. This phenomenon is mainly attributed to the distribution shift introduced during the Supervised Fine - Tuning (SFT) process. That is, during training, it depends on the teacher forcing method, while during reasoning, the model must predict the next word based on its previous output, resulting in a distribution difference between training and reasoning. This difference is particularly evident when generating long - form reasoning, thus affecting the model's multimodal reasoning ability. To solve this problem, the paper proposes a Preference Optimization (PO) method. By constructing a high - quality multimodal preference dataset (MMPR) and a Mixed Preference Optimization (MPO) algorithm, the multimodal reasoning ability of MLLMs is enhanced. The MPO method combines Preference Loss, Quality Loss, and Generation Loss, aiming to enable the model to learn the relative preferences between different responses, the absolute quality of a single response, and the process of generating preferred responses. Experimental results show that the model optimized by MPO performs well in multiple benchmark tests, especially in multimodal reasoning tasks, significantly outperforming the baseline model, and on some tasks, its performance is close to or even better than that of a model with 10 times more parameters.

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models