Abstract:Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{<a class="link-external link-https" href="https://github.com/gregor-ge/FOCI-Benchmark" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The main focus of this paper is to evaluate the performance of Large Vision-Language Models (LVLMs) on fine-grained object classification tasks and to explore how to improve these models to better handle such tasks. Specifically, the paper points out that although LVLMs perform excellently on various image understanding and reasoning tasks, their ability to distinguish fine-grained object categories (such as different animal species) has not been fully assessed. To fill this gap, the authors created a benchmark dataset called FOCI (Fine-grained Object ClassIfication) to evaluate the models' performance in fine-grained object recognition. The key contributions of the paper include: 1. **Proposing the FOCI benchmark**: This is a multiple-choice benchmark dataset designed to avoid issues present in open-ended question formats, such as answer uncertainty and the inability to provide all possible correct answers. By using the CLIP model to mine difficult choices for each test image, it ensures that the task remains challenging even with a limited number of candidate answers. 2. **Evaluating existing LVLMs**: A comprehensive evaluation of 12 publicly available LVLMs was conducted, revealing that many models (such as the popular LLaVA 1.5) perform poorly on fine-grained object classification tasks. Additionally, the paper observed that although some models perform similarly on traditional benchmarks, their results on FOCI show significant differences, indicating that fine-grained object classification is a skill distinct from general image understanding. 3. **Analyzing factors affecting model performance**: The study examined the impact of factors such as the amount of training data and model architecture on the performance of LVLMs in fine-grained object classification tasks. Notably, the paper found that a larger pre-training dataset is crucial for improving model performance on such tasks. 4. **Comparison between LVLMs and their corresponding CLIP models**: By comparing the performance of LVLMs with the CLIP image encoders they use, the paper highlights the importance of the choice of image encoder for LVLM performance and points out the current misalignment between image encoders and language models in LVLMs, especially when handling fine-grained object classification. In summary, this paper aims to emphasize fine-grained object classification as an important visual recognition skill and introduces a new benchmark dataset, FOCI, to evaluate and improve the performance of LVLMs on such tasks.

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Fine-Grained Visual Categorization With Fine-Tuned Segmentation

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

OV-VG: A benchmark for open-vocabulary visual grounding

FiVL: A Framework for Improved Vision-Language Alignment

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

Democratizing Fine-grained Visual Recognition with Large Language Models

Simple Image-level Classification Improves Open-vocabulary Object Detection

Open-Vocabulary Camouflaged Object Segmentation

HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

CLIM: Contrastive Language-Image Mosaic for Region Representation

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Is CLIP the main roadblock for fine-grained open-world perception?

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models

Revisiting Few-Shot Object Detection with Vision-Language Models