Efficient Multimodal Learning from Data-centric Perspective

Muyang He,Yexin Liu,Boya Wu,Jianhao Yuan,Yueze Wang,Tiejun Huang,Bo Zhao
2024-07-22
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and language models, which inevitably cause significant performance drops. In this paper, we demonstrate the possibility of training a smaller but better MLLM with high-quality training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from selected training data. Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks. We expect that this work can provide the community with a clean and flexible open-source tool for further research and development. The code, models, and data can be found in <a class="link-external link-https" href="https://github.com/BAAI-DCAI/Bunny" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the high computational cost associated with training and inference of Multimodal Large Language Models (MLLMs), which limits their application across a broader range of research and user communities. To tackle this issue, researchers have introduced a new approach that involves optimizing training data to train smaller yet better-performing multimodal models. Specifically, the paper introduces a lightweight multimodal model family named Bunny, which features flexible visual and language model backend structures. To compensate for the performance drop due to model downsizing, the researchers constructed a higher-quality training dataset. Experimental results show that the proposed Bunny-4B/8B models not only outperform similarly sized small MLLMs on multiple benchmarks but even surpass the performance of large MLLMs in some cases. The main contributions of the paper can be summarized as follows: 1. **Introduction of the Bunny model**: This is a lightweight multimodal model with a plug-and-play visual encoder and language model backend, as well as a projector for cross-modal fusion. Bunny supports a variety of lightweight visual encoders (such as SigLIP) and language models (such as Phi, Llama, etc.). 2. **Data Optimization**: The researchers designed a three-step method to select a high-quality subset from large-scale datasets (e.g., LAION-2B) as training data to improve model efficiency. This method includes clustering, graph construction, and ranking steps, ultimately resulting in a high-quality subset containing 2 million samples. 3. **Model Performance**: A series of experiments demonstrated that even at a smaller model scale, the Bunny models trained with optimized data achieve excellent results on multiple tasks, including visual question answering and multimodal understanding. In summary, the goal of this paper is to reduce the computational cost of multimodal large language models through data optimization methods while maintaining or enhancing their performance, thereby promoting the widespread adoption and application of such models.