Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Weiyun Wang,Zhe Chen,Wenhai Wang,Yue Cao,Yangzhou Liu,Zhangwei Gao,Jinguo Zhu,Xizhou Zhu,Lewei Lu,Yu Qiao,Jifeng Dai
2024-11-16
Abstract:Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the poor performance of existing open - source multimodal large language models (MLLMs) in Chain - of - Thought (CoT) reasoning. Specifically, these models perform well when answering with direct answers, but their performance degrades when giving CoT answers that require a detailed reasoning process. This phenomenon is mainly attributed to the distribution shift introduced during the Supervised Fine - Tuning (SFT) process. That is, during training, it depends on the teacher forcing method, while during reasoning, the model must predict the next word based on its previous output, resulting in a distribution difference between training and reasoning. This difference is particularly evident when generating long - form reasoning, thus affecting the model's multimodal reasoning ability. To solve this problem, the paper proposes a Preference Optimization (PO) method. By constructing a high - quality multimodal preference dataset (MMPR) and a Mixed Preference Optimization (MPO) algorithm, the multimodal reasoning ability of MLLMs is enhanced. The MPO method combines Preference Loss, Quality Loss, and Generation Loss, aiming to enable the model to learn the relative preferences between different responses, the absolute quality of a single response, and the process of generating preferred responses. Experimental results show that the model optimized by MPO performs well in multiple benchmark tests, especially in multimodal reasoning tasks, significantly outperforming the baseline model, and on some tasks, its performance is close to or even better than that of a model with 10 times more parameters.