Abstract:Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task. The LLM router undergoes training using a dataset constructed from more than 2.5 million examples of pairs of visual task and model accuracy. Our results reveal that this lightweight fix surpasses or matches the accuracy of state-of-the-art alternatives, including GPT-4V and HuggingGPT, while improving cost-effectiveness.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: when using a model that combines visual - language models (VLMs) with large - language models (LLMs) (i.e., VLM + LLMs) for image classification tasks, whether this combination can improve the performance of traditional image classification. Specifically, the authors evaluated the performance of VLM + LLMs on a variety of visual - understanding datasets through extensive experiments and compared them with VLMs without using LLMs to explore the impact of LLMs on image - classification tasks. Three possible results are proposed in the paper: 1. **VLM+LLMs outperform VLMs on all image - classification tasks**: This is the most intuitive expected result because currently VLM+LLMs occupy the state - of - the - art position in most visual tasks. 2. **VLMs outperform VLM+LLMs on certain datasets**: If this phenomenon is observed, further research is required to determine under which circumstances VLMs will perform better. 3. **VLMs always outperform VLM+LLMs on image - classification tasks**: Although this result seems unexpected at first glance, the authors believe that excessive language knowledge may interfere with the learning of visual representations, resulting in this result. Through experiments, the authors found that although VLM+LLMs, with an increase in the number of parameters, perform worse than VLMs on object and scene - recognition tasks, they perform better on tasks requiring reasoning and external knowledge. Based on these observations, the authors proposed a lightweight solution, that is, using a relatively small LLM as a router to select the most suitable model to perform the task according to the task requirements. This router is trained to efficiently route visual tasks to the model that is most suitable for handling the task, thereby achieving or exceeding the performance of existing state - of - the - art methods while maintaining cost - effectiveness.

Rethinking VLMs and LLMs for Image Classification

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

An Introduction to Vision-Language Modeling

Inference Optimal VLMs Need Only One Visual Token but Larger Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Towards Interpreting Visual Information Processing in Vision-Language Models

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Does VLM Classification Benefit from LLM Description Semantics?

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Visual Classification via Description from Large Language Models

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Visually-Augmented Language Modeling

Why are Visually-Grounded Language Models Bad at Image Classification?

Enhance Reasoning Ability of Visual-Language Models via Large Language Models