Infrared-LLaVA: Enhancing Understanding of Infrared Images in Multi-Modal Large Language Models

Shixin Jiang,Zerui Chen,Jiafeng Liang,Yanyan Zhao,Ming Liu,Bing Qin
DOI: https://doi.org/10.18653/v1/2024.findings-emnlp.501
2024-01-01
Abstract:Expanding the understanding capabilities of multi-modal large language models (MLLMs) for infrared modality is a challenge due to the single-modality nature and limited amount of training data. Existing methods typically construct a uniform embedding space for cross-modal alignment and leverage abundant visual image data to indirectly understand infrared images. However, they ignore the supervisory signals of infrared-modality-specific attributes, which may lead to biased understanding of infrared images. To address this issue, we propose a debating multi-agent generation system which transfers knowledge from visible images to generate infrared image-text pairs and infrared instruction data. Moreover, we construct an infrared question-answering benchmark based on common infrared tasks. Experimental results from incremental fine-tuning on existing models and our Infrared-LLaVA-7B trained from scratch on infrared data demonstrate the effectiveness of the generated data and the feasibility of the generation approach.
What problem does this paper attempt to address?