Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

Jeonghwan Kim,Heng Ji
2024-10-04
Abstract:Recent advances in instruction-tuned Large Vision-Language Models (LVLMs) have imbued the models with the ability to generate high-level, image-grounded explanations with ease. While such capability is largely attributed to the rich world knowledge contained within the Large Language Models (LLMs), our work reveals their shortcomings in fine-grained visual categorization (FGVC) across six different benchmark settings. Most recent state-of-the-art LVLMs like LLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of classification performance, e.g., average drop of 65.58 in EM for Stanford Dogs for LLaVA-1.5, but also struggle to generate an accurate explanation with detailed attributes based on the concept that appears within an input image despite their capability to generate holistic image-level descriptions. In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept, preventing the image modality from leveraging the rich parametric knowledge within the LLMs. In an effort to further the community's endeavor in this direction, we propose a multiple granularity attribute-centric evaluation benchmark, Finer, which aims to establish a ground to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the poor performance of large - scale visual - language models (LVLMs) in fine - grained visual concept recognition (FGVC) tasks. Although these models perform excellently in generating high - level image interpretations, their performance drops significantly when distinguishing between fine - grained categories (such as different kinds of dogs or specific models of airplanes). Specifically, the paper points out: 1. **Insufficient fine - grained visual classification ability**: Even the most advanced LVLMs, such as LLaVa - 1.5, InstructBLIP and GPT - 4V, also show an obvious performance drop in fine - grained visual classification tasks. For example, the exact match (EM) score of LLaVa - 1.5 on the Stanford Dogs dataset drops by an average of 65.58. 2. **Modality gap**: There are differences in how these models handle text and visual inputs. When given text and visual inputs related to the same concept, the model's performance is inconsistent. In particular, when generating descriptive visual attributes, the model cannot fully utilize the image input to infer fine - grained concepts. 3. **Lack of fine - grained image understanding**: LVLMs perform poorly in generating detailed descriptions of fine - grained concepts, indicating that they have limitations in understanding the fine - grained details in images. To address these problems, the paper proposes a new benchmark and training - mixing method called FINER. FINER aims to evaluate the performance of LVLMs in fine - grained visual concept recognition and provides a method to alleviate the modality gap, thereby improving the model's fine - grained image understanding ability. Specific contributions include: - **Revealing the deficiencies of LVLMs in fine - grained image understanding**: This is the first time to systematically explore the performance of LVLMs in fine - grained visual classification tasks. - **Analyzing the existence of the modality gap**: Through extensive experiments, the differences in how LVLMs handle text and visual inputs are proven. - **Constructing a new fine - grained concept recognition benchmark**: FINER contains concept labels and visual attributes at multiple granularity levels for evaluating the fine - grained image understanding ability of LVLMs. Through these contributions, the paper provides new directions for future research to further improve the performance of LVLMs in fine - grained visual tasks.