ML 2 MG-VLCR: A Multimodal LLM Guided Zero-shot Method for Visio-linguistic Compositional Reasoning with Autoregressive Generative Language Model

Ziyu Gong,Chengcheng Mai,Yihua Huang
DOI: https://doi.org/10.1145/3652583.3658016
2024-01-01
Abstract:The visio-linguistic compositional reasoning is an interesting but challenging task aimed at matching two images and two captions, where the two images are different but the two corresponding captions are composed of the same words but in different order. This requires the matching model to have the ability to understand both the composition structure of the image and the order of the description text. However, when faced with compositional reasoning tasks, existing vision-language models are not sensitive to the image structure and text order, acting more like bag-of-words models. To address this challenge, a zero-shot visio-linguistic compositional reasoning method was proposed with the assistance of multimodal LLM and autoregressive generative language model. Given an image and candidate texts with different order compositions, we first leveraged LLaVA to generate descriptive text according to the image, for reflecting the compositional structure of image into text order. Then, an order-sensitive image-text matching method was proposed by calculating the generation probability of the candidate text conditioned on the textualized image information obtained by LLaVA, where autoregressive generative language model explicitly plays an important role in order modeling and evaluating. Experimental results on VG-Relation, VG-Attribution and Flickr30K-Order, demonstrated the superiority of our method in understanding the compositional structure and order of images and texts.
What problem does this paper attempt to address?