Empowering Vision-Language Models for Reasoning Ability Through Large Language Models

Yueting Yang,Xintong Zhang,Jinan Xu,Wenjuan Han
DOI: https://doi.org/10.1109/icassp48485.2024.10446407
2024-01-01
Abstract:Vision-language models (VLM) have shown excellent performance in vision-language tasks. However, they sometimes lack sufficient reasoning ability. In contrast, large language models (LLMs) have emerged with powerful reasoning capabilities. Therefore, we propose a framework called TReE, which transfers the reasoning ability of the LLM to the VLM in learning-free settings. TReE is a three-stage framework: observation, thinking, and re-thinking. The observation stage requires the VLM to obtain overall visual information about the image. Then, the thinking stage combines the visual information and task description as the prompt for the LLM, allowing it to present the thinking process (namely, rationale). Lastly, the re-thinking stage learns useful information from the rationale and then predicts the final result using the VLM. We are the first to explore enhancing the VLM’s reasoning ability without any training, finetuning, or access to the LLM’s parameters, which we refer to as a plug-in mode, leading to the model-agnostic feature. Experiments show that TReE performed well on general visual questionanswering (VQA) tasks and outperformed KOSMOS-1 on the challenging Raven IQ test dataset by 6%. Furthermore, with additional lightweight finetuning using a smaller amount of parameters, TReE achieved a high accuracy of 81.7% on GQA and 67.3% on VQAv2.
What problem does this paper attempt to address?