MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang,Mingfei Gao,Zhe Gan,Philipp Dufter,Nina Wenzel,Forrest Huang,Dhruti Shah,Xianzhi Du,Bowen Zhang,Yanghao Li,Sam Dodge,Keen You,Zhen Yang,Aleksei Timofeev,Mingze Xu,Hong-You Chen,Jean-Philippe Fauconnier,Zhengfeng Lai,Haoxuan You,Zirui Wang,Afshin Dehghan,Peter Grasch,Yinfei Yang

2024-10-01

Abstract:We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

Computer Vision and Pattern Recognition,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the enhancement of Multimodal Large Language Models (MLLMs) in handling image understanding, visual referencing and localization, and multi-image reasoning. Specifically: 1. **Text-rich Image Understanding**: Enhancing the model's ability to understand text-rich images by introducing high-resolution image data and high-quality Optical Character Recognition (OCR) data. 2. **Visual Referencing and Localization**: Enabling the model to understand and interpret visual cues (such as points and bounding boxes) and generate image-based responses in the output, thereby achieving more refined image understanding. 3. **Multi-image Reasoning and Context Learning**: Equipping the model with strong context learning and multi-image reasoning capabilities through large-scale interleaved pre-training. Additionally, the paper explores the performance of models of different scales (ranging from 1 billion to 30 billion parameters) and introduces two specialized versions: MM1.5-Video (for video understanding) and MM1.5-UI (for mobile user interface understanding). Through extensive experimental studies, the paper provides a detailed analysis of the training process, offering valuable guidance for future MLLM research.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

MM-LLMs: Recent Advances in MultiModal Large Language Models

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Matryoshka Multimodal Models

CaMML: Context-Aware Multimodal Learner for Large Models

Multimodal Instruction Tuning with Hybrid State Space Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

InfMLLM: A Unified Framework for Visual-Language Tasks.

Model Composition for Multimodal Large Language Models

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

NoteLLM-2: Multimodal Large Representation Models for Recommendation

A Survey on Multimodal Large Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks