African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Gregor Geigle,Radu Timofte,Goran Glavaš
2024-06-21
Abstract:Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{<a class="link-external link-https" href="https://github.com/gregor-ge/FOCI-Benchmark" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The main focus of this paper is to evaluate the performance of Large Vision-Language Models (LVLMs) on fine-grained object classification tasks and to explore how to improve these models to better handle such tasks. Specifically, the paper points out that although LVLMs perform excellently on various image understanding and reasoning tasks, their ability to distinguish fine-grained object categories (such as different animal species) has not been fully assessed. To fill this gap, the authors created a benchmark dataset called FOCI (Fine-grained Object ClassIfication) to evaluate the models' performance in fine-grained object recognition. The key contributions of the paper include: 1. **Proposing the FOCI benchmark**: This is a multiple-choice benchmark dataset designed to avoid issues present in open-ended question formats, such as answer uncertainty and the inability to provide all possible correct answers. By using the CLIP model to mine difficult choices for each test image, it ensures that the task remains challenging even with a limited number of candidate answers. 2. **Evaluating existing LVLMs**: A comprehensive evaluation of 12 publicly available LVLMs was conducted, revealing that many models (such as the popular LLaVA 1.5) perform poorly on fine-grained object classification tasks. Additionally, the paper observed that although some models perform similarly on traditional benchmarks, their results on FOCI show significant differences, indicating that fine-grained object classification is a skill distinct from general image understanding. 3. **Analyzing factors affecting model performance**: The study examined the impact of factors such as the amount of training data and model architecture on the performance of LVLMs in fine-grained object classification tasks. Notably, the paper found that a larger pre-training dataset is crucial for improving model performance on such tasks. 4. **Comparison between LVLMs and their corresponding CLIP models**: By comparing the performance of LVLMs with the CLIP image encoders they use, the paper highlights the importance of the choice of image encoder for LVLM performance and points out the current misalignment between image encoders and language models in LVLMs, especially when handling fine-grained object classification. In summary, this paper aims to emphasize fine-grained object classification as an important visual recognition skill and introduces a new benchmark dataset, FOCI, to evaluate and improve the performance of LVLMs on such tasks.