Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning

Xian Zhang,Haokun Wen,Jianlong Wu,Pengda Qin,Hui Xue',Liqiang Nie
DOI: https://doi.org/10.1145/3664647.3681453
2024-01-01
Abstract:Change captioning involves describing the subtle changes between a pair of similar images. Although existing efforts have achieved compelling success, they overlook the potential of multimodal large language models (MLLMs) in tackling this challenging task. In this work, we aim to empower MLLMs with the capability to perceive subtle differences between paired images and enhance their performance in generating change captions. Specifically, we present a diFferentIal-perceptive aNd rEtRieval-augmented MLLM (FINER-MLLM) tailored for this task. In particular, FINER-MLLM leverages LoRA fine-tuned MLLM's image encoder to extract image patch features, enabling the capture of detailed image information. Subsequently, within MLLM's feature extraction, typically Q-Former, FINER-MLLM incorporates dual constraints: the intra-image feature independence constraint and the inter-image feature alignment constraint. These constraints ensure that the features can comprehensively extract subtle visual information within each image and that corresponding features across images align effectively. Last, we introduced the retrieval augmentation to first retrieve the relevant corpus to facilitate the MLLM's decoder i.e., LLM, in generating accurate change captions. Extensive experiments on three benchmark datasets, i.e., CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the superiority of our proposed method.
What problem does this paper attempt to address?