Abstract:In recent years, multimodal large language models (MLLMs) such as GPT-4V have demonstrated remarkable advancements, excelling in a variety of vision-language tasks. Despite their prowess, the closed-source nature and computational demands of such models limit their accessibility and applicability. This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks, including image captioning (IC) and visual question answering (VQA). Leveraging a compact yet powerful architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. With a training regimen optimized for small backbones and employing a diverse dataset amalgam, TinyGPT-V requires significantly lower computational resources 24GB for training and as little as 8GB for inference without compromising on performance. Our experiments demonstrate that TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable results in VQA and image inference tasks to its larger counterparts while being uniquely suited for deployment on resource-constrained devices through innovative quantization techniques. This work not only paves the way for more accessible and efficient MLLMs but also underscores the potential of smaller, optimized models in bridging the gap between high performance and computational efficiency in real-world applications. Additionally, this paper introduces a new approach to multimodal large language models using smaller backbones. Our code and training weights are available in the supplementary material.

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference

Understanding LLMs: A Comprehensive Overview from Training to Inference

Improving Large Models with Small models: Lower Costs and Better Performance

ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

Parameter-efficient Tuning for Large Language Model Without Calculating Its Gradients

Mini-GPTs: Efficient Large Language Models through Contextual Pruning

FoldGPT: Simple and Effective Large Language Model Compression Scheme

FLM-101B: An Open LLM and How to Train It with $100K Budget

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

Large Language Models (LLMs): Deployment, Tokenomics and Sustainability

Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level

CPM-2: Large-scale Cost-effective Pre-trained Language Models

The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts

MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones