Abstract:Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Large Vision Language Models (LVLMs). Current LVLMs are predominantly constrained to locate a single, pre-existing object. This limitation leads to a compromise in model design, necessitating the introduction of visual expert models or customized head structures. Beyond these constraints, our research uncovers LVLMs' capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that Griffon not only achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO. Data, codes, and models are released at <a class="link-external link-https" href="https://github.com/jefferyZhan/Griffon" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: Currently, large - scale vision - language models (LVLMs) have limited capabilities in fine - grained object perception and spatial localization, especially when dealing with multiple objects or complex scenes. Existing LVLMs are usually only able to locate a single existing object and perform poorly when handling multi - target descriptions (such as pronouns, categories, phrases, etc.). In addition, these models also have the problem of being unable to reject non - existent objects. These problems limit the universality and flexibility of LVLMs in practical applications. To overcome these challenges, the author proposes a new method to enhance the capabilities of LVLMs in fine - grained object perception and precise position identification by constructing a model named Griffon. Griffon is a baseline model purely based on LVLMs, without the need to introduce special tokens, expert models or additional detection modules. Instead, it achieves the localization of objects of any granularity through a unified data format and a carefully designed training process. Specifically, the contributions of Griffon include: 1. **Constructed a new language - guided localization dataset**: This dataset contains nearly 6 million basic pre - training data and 900,000 instruction - following data, covering all four possible localization - related scenarios and more than 76,000 object categories, aiming to comprehensively improve the ability of LVLMs to simultaneously locate multiple objects in complex scenes. 2. **Proposed the Griffon model**: This is a unified LVLM baseline model that can locate all objects according to free - form input text, and has a streamlined architecture and unified input - output representation, without any special tokens, prior knowledge or additional detection heads. 3. **Designed a two - stage training process**: The first stage focuses on basic pre - training, aiming to improve the fine - grained multi - object localization ability; the second stage conducts full - scene instruction tuning, which significantly improves the ability to understand user intentions. 4. **Introduced an untrained confidence scoring mechanism**: This mechanism is used to rank the detected objects, enhancing the model's ability to give priority to more confident detections. Through these innovations, the experimental results of Griffon on multiple public datasets show that it not only achieves state - of - the - art performance on the fine - grained RefCOCO series and Flickr30K Entities datasets, but also has a performance on the detection benchmark MSCOCO that is close to that of expert models. This proves the effectiveness of Griffon in enhancing the fine - grained object perception and spatial localization capabilities of LVLMs.

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Pixel Aligned Language Models

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Democratizing Fine-grained Visual Recognition with Large Language Models

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Towards Vision-Language Geo-Foundation Model: A Survey

OV-VG: A benchmark for open-vocabulary visual grounding

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model