What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Yan Zeng,Hanbo Zhang,Jiani Zheng,Jiangnan Xia,Guoqiang Wei,Yang Wei,Yuchen Zhang,Tao Kong
2023-07-30
Abstract:Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The main problems this paper attempts to address are: 1. **How to train GPT-4 style language models with multimodal input capabilities**: Recently, large language models (LLMs) like GPT-4 have shown excellent performance in handling multimodal data, being able to execute open-ended instructions based on images. However, the performance of these models is highly dependent on the design choices of network architecture, training data, and training strategies. These design choices are insufficiently discussed in the existing literature, making it difficult to quantify progress in this field. 2. **Identifying key factors affecting the performance of multimodal LLMs**: To improve the performance of multimodal language models, it is necessary to identify which factors are crucial. The paper systematically and comprehensively explores the impact of network architecture, training data, diverse instructions, and other factors. It proposes a new model named Lynx, which outperforms existing open-source GPT-4 style models in multimodal understanding and generation capabilities. 3. **Establishing appropriate evaluation benchmarks**: There is currently a lack of quantitative benchmarks suitable for evaluating and comparing multimodal LLMs, making it difficult to attribute and quantify the progress of open-source multimodal LLMs. To address this, the paper contributes a comprehensive evaluation set that includes image and video tasks, constructed through crowdsourcing, to assess the multimodal understanding and text generation performance of the models. By addressing these issues, the paper aims to provide guidance for the research and development of multimodal LLMs, promoting further advancement in this field.