From Image to Video, what do we need in multimodal LLMs?

Suyuan Huang,Haoxin Zhang,Yan Gao,Yao Hu,Zengchang Qin
2024-04-18
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper discusses the problem of transitioning from Image Large Language Models (LLMs) to Video Large Language Models (LLMs). Current methods often overlook the fundamental contributions of Image LLMs and instead rely on more complex architectures and large amounts of multimodal data for pre-training, which increases the cost. The study proposes a resource-efficient development method called RED-VILLM, which utilizes the prior knowledge of Image LLMs and extends the model's understanding of temporal information through a time-adaptive plug-and-play structure. This allows the development of high-performance Video LLMs with limited guidance data and training resources. This approach emphasizes the potential of developing multimodal models in a more economical and scalable manner, building effectively upon the foundation of Image LLMs. Experiment results show that RED-VILLM improves the model's understanding of time and surpasses baseline models in video understanding and generation tasks.