Abstract:Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

What problem does this paper attempt to address?

The main problems this paper attempts to address are: 1. **How to train GPT-4 style language models with multimodal input capabilities**: Recently, large language models (LLMs) like GPT-4 have shown excellent performance in handling multimodal data, being able to execute open-ended instructions based on images. However, the performance of these models is highly dependent on the design choices of network architecture, training data, and training strategies. These design choices are insufficiently discussed in the existing literature, making it difficult to quantify progress in this field. 2. **Identifying key factors affecting the performance of multimodal LLMs**: To improve the performance of multimodal language models, it is necessary to identify which factors are crucial. The paper systematically and comprehensively explores the impact of network architecture, training data, diverse instructions, and other factors. It proposes a new model named Lynx, which outperforms existing open-source GPT-4 style models in multimodal understanding and generation capabilities. 3. **Establishing appropriate evaluation benchmarks**: There is currently a lack of quantitative benchmarks suitable for evaluating and comparing multimodal LLMs, making it difficult to attribute and quantify the progress of open-source multimodal LLMs. To address this, the paper contributes a comprehensive evaluation set that includes image and video tasks, constructed through crowdsourcing, to assess the multimodal understanding and text generation performance of the models. By addressing these issues, the paper aims to provide guidance for the research and development of multimodal LLMs, promoting further advancement in this field.

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

A Survey on Multimodal Large Language Models

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

On the Performance of Multimodal Language Models

Cross-Modal Consistency in Multimodal Large Language Models

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

A Survey on Evaluation of Multimodal Large Language Models

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities