Abstract:Recent advances in instruction-tuned Large Vision-Language Models (LVLMs) have imbued the models with the ability to generate high-level, image-grounded explanations with ease. While such capability is largely attributed to the rich world knowledge contained within the Large Language Models (LLMs), our work reveals their shortcomings in fine-grained visual categorization (FGVC) across six different benchmark settings. Most recent state-of-the-art LVLMs like LLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of classification performance, e.g., average drop of 65.58 in EM for Stanford Dogs for LLaVA-1.5, but also struggle to generate an accurate explanation with detailed attributes based on the concept that appears within an input image despite their capability to generate holistic image-level descriptions. In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept, preventing the image modality from leveraging the rich parametric knowledge within the LLMs. In an effort to further the community's endeavor in this direction, we propose a multiple granularity attribute-centric evaluation benchmark, Finer, which aims to establish a ground to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the poor performance of large - scale visual - language models (LVLMs) in fine - grained visual concept recognition (FGVC) tasks. Although these models perform excellently in generating high - level image interpretations, their performance drops significantly when distinguishing between fine - grained categories (such as different kinds of dogs or specific models of airplanes). Specifically, the paper points out: 1. **Insufficient fine - grained visual classification ability**: Even the most advanced LVLMs, such as LLaVa - 1.5, InstructBLIP and GPT - 4V, also show an obvious performance drop in fine - grained visual classification tasks. For example, the exact match (EM) score of LLaVa - 1.5 on the Stanford Dogs dataset drops by an average of 65.58. 2. **Modality gap**: There are differences in how these models handle text and visual inputs. When given text and visual inputs related to the same concept, the model's performance is inconsistent. In particular, when generating descriptive visual attributes, the model cannot fully utilize the image input to infer fine - grained concepts. 3. **Lack of fine - grained image understanding**: LVLMs perform poorly in generating detailed descriptions of fine - grained concepts, indicating that they have limitations in understanding the fine - grained details in images. To address these problems, the paper proposes a new benchmark and training - mixing method called FINER. FINER aims to evaluate the performance of LVLMs in fine - grained visual concept recognition and provides a method to alleviate the modality gap, thereby improving the model's fine - grained image understanding ability. Specific contributions include: - **Revealing the deficiencies of LVLMs in fine - grained image understanding**: This is the first time to systematically explore the performance of LVLMs in fine - grained visual classification tasks. - **Analyzing the existence of the modality gap**: Through extensive experiments, the differences in how LVLMs handle text and visual inputs are proven. - **Constructing a new fine - grained concept recognition benchmark**: FINER contains concept labels and visual attributes at multiple granularity levels for evaluating the fine - grained image understanding ability of LVLMs. Through these contributions, the paper provides new directions for future research to further improve the performance of LVLMs in fine - grained visual tasks.

Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Democratizing Fine-grained Visual Recognition with Large Language Models

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

DEAL: Disentangle and Localize Concept-level Explanations for VLMs

Probing Conceptual Understanding of Large Visual-Language Models

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Discriminative Fine-tuning of LVLMs

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

CoLLaVO: Crayon Large Language and Vision mOdel

Rethinking Overlooked Aspects in Vision-Language Models

Rethinking VLMs and LLMs for Image Classification

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs