Rethinking VLMs and LLMs for Image Classification

Avi Cooper,Keizo Kato,Chia-Hsien Shih,Hiroaki Yamane,Kasper Vinken,Kentaro Takemoto,Taro Sunagawa,Hao-Wei Yeh,Jin Yamanaka,Ian Mason,Xavier Boix
2024-10-04
Abstract:Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task. The LLM router undergoes training using a dataset constructed from more than 2.5 million examples of pairs of visual task and model accuracy. Our results reveal that this lightweight fix surpasses or matches the accuracy of state-of-the-art alternatives, including GPT-4V and HuggingGPT, while improving cost-effectiveness.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: when using a model that combines visual - language models (VLMs) with large - language models (LLMs) (i.e., VLM + LLMs) for image classification tasks, whether this combination can improve the performance of traditional image classification. Specifically, the authors evaluated the performance of VLM + LLMs on a variety of visual - understanding datasets through extensive experiments and compared them with VLMs without using LLMs to explore the impact of LLMs on image - classification tasks. Three possible results are proposed in the paper: 1. **VLM+LLMs outperform VLMs on all image - classification tasks**: This is the most intuitive expected result because currently VLM+LLMs occupy the state - of - the - art position in most visual tasks. 2. **VLMs outperform VLM+LLMs on certain datasets**: If this phenomenon is observed, further research is required to determine under which circumstances VLMs will perform better. 3. **VLMs always outperform VLM+LLMs on image - classification tasks**: Although this result seems unexpected at first glance, the authors believe that excessive language knowledge may interfere with the learning of visual representations, resulting in this result. Through experiments, the authors found that although VLM+LLMs, with an increase in the number of parameters, perform worse than VLMs on object and scene - recognition tasks, they perform better on tasks requiring reasoning and external knowledge. Based on these observations, the authors proposed a lightweight solution, that is, using a relatively small LLM as a router to select the most suitable model to perform the task according to the task requirements. This router is trained to efficiently route visual tasks to the model that is most suitable for handling the task, thereby achieving or exceeding the performance of existing state - of - the - art methods while maintaining cost - effectiveness.