Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

Jingyu Wei,Yi Su,Kele Xu,Lingbin Zeng,Bo Liu,Huaimin Wang
DOI: https://doi.org/10.1145/3664647.3688995
2024-01-01
Abstract:Multimodal Large Language Models (MLLMs), by expanding the model's capabilities to perceive and interact through multi-modalities, have significantly enhanced performance across various tasks. The perception of vision is an important modality developed into LLM, enabling research in vision-language to continuously lead the cutting-edge advancements in the MLLM community. However, the standard pre-training pipeline on image-text pairs results in a limited model understanding of relationships between multiple images and texts, as well as visual details. Additionally, the setting of fine-tuning with a frozen visual backbone hinders the enhancement of visual representations on new data. These two issues lead to suboptimal performance in models for demonstrative instruction following about multiple images. This work introduces a novel framework called MLoEM, which first converts long multimodal data into an interleaved image-instruction format, and then adopts a fully autoregressive architecture model, allowing for more robust and coherent learning from naturally occurring multimodal documents than pair-based pipeline. Additionally, we incorporate the Low-Rank Adaptation (LoRA) fine-tuning method, enhancing visual representations while maintaining the stability of previously learned knowledge. Finally, we utilize ensemble methods to enhance model performance on tasks. To alleviate the storage overhead issue of parallel ensembles with large models, we design an ensemble approach that shares the MLLM while only switching the LoRA matrices. In the experiments, the proposed MLoEM shows superior performance on the testing set.
What problem does this paper attempt to address?