Collaborative Training of Tiny-Large Vision Language Models

Shichen Lu,Longteng Guo,Wenxuan Wang,Zijia Zhao,Tongtian Yue,Jing Liu,Si Liu
DOI: https://doi.org/10.1145/3664647.3681026
2024-01-01
Abstract:Recently, large vision language models (LVLMs) have advanced AI by integrating visual and linguistic data for tasks like visual conversation, image captioning, and visual question answering. Current LVLM research either scales up model size for performance or reduces parameters for limited computational resources. We believe both large and tiny models have unique strengths and that collaborative training yields better results than independent training. We propose Collaborative Training of Tiny-Large Vision Language Models (CTVLMs), a framework connecting large and tiny models via a projection layer and leveraging a synergistic training strategy. Our framework improves training efficiency by strengthening the interconnection between large and tiny models. Using the parameter efficiency of tiny models, we effectively align image-text features, then apply knowledge distillation to help large models better align cross-modal information. During fine-tuning, the large model's extensive knowledge enhances tiny model's performance. This collaborative approach allows models to adapt to various computational resources and outperforms existing methods in vision-language tasks.
What problem does this paper attempt to address?