Abstract:Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and language models, which inevitably cause significant performance drops. In this paper, we demonstrate the possibility of training a smaller but better MLLM with high-quality training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from selected training data. Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks. We expect that this work can provide the community with a clean and flexible open-source tool for further research and development. The code, models, and data can be found in <a class="link-external link-https" href="https://github.com/BAAI-DCAI/Bunny" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address the high computational cost associated with training and inference of Multimodal Large Language Models (MLLMs), which limits their application across a broader range of research and user communities. To tackle this issue, researchers have introduced a new approach that involves optimizing training data to train smaller yet better-performing multimodal models. Specifically, the paper introduces a lightweight multimodal model family named Bunny, which features flexible visual and language model backend structures. To compensate for the performance drop due to model downsizing, the researchers constructed a higher-quality training dataset. Experimental results show that the proposed Bunny-4B/8B models not only outperform similarly sized small MLLMs on multiple benchmarks but even surpass the performance of large MLLMs in some cases. The main contributions of the paper can be summarized as follows: 1. **Introduction of the Bunny model**: This is a lightweight multimodal model with a plug-and-play visual encoder and language model backend, as well as a projector for cross-modal fusion. Bunny supports a variety of lightweight visual encoders (such as SigLIP) and language models (such as Phi, Llama, etc.). 2. **Data Optimization**: The researchers designed a three-step method to select a high-quality subset from large-scale datasets (e.g., LAION-2B) as training data to improve model efficiency. This method includes clustering, graph construction, and ranking steps, ultimately resulting in a high-quality subset containing 2 million samples. 3. **Model Performance**: A series of experiments demonstrated that even at a smaller model scale, the Bunny models trained with optimized data achieve excellent results on multiple tasks, including visual question answering and multimodal understanding. In summary, the goal of this paper is to reduce the computational cost of multimodal large language models through data optimization methods while maintaining or enhancing their performance, thereby promoting the widespread adoption and application of such models.

Efficient Multimodal Learning from Data-centric Perspective

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Efficient Multimodal Large Language Models: A Survey

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

InfMLLM: A Unified Framework for Visual-Language Tasks.

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Multimodal Instruction Tuning with Hybrid State Space Models

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

CaMML: Context-Aware Multimodal Learner for Large Models

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Matryoshka Multimodal Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge