PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Michael Dorkenwald,Nimrod Barazani,Cees G. M. Snoek,Yuki M. Asano

2024-02-14

Abstract:Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper mainly discusses how to unlock the object localization ability in title-based Vision-Language Models (VLMs) without additional supervision data or model parameter changes. Existing VLMs face challenges in object localization because they are primarily trained on multimodal data with captions, lacking explicit spatial localization. The paper proposes a learning-based spatial cue called "Position Insertion" (PIN), which is a lightweight parameter vector that can be slid in a frozen VLM to enhance the model's spatial awareness. The PIN module is trained on synthetic data through a simple next token prediction task, which includes rendered synthetic objects and background images, providing precise ground truth locations. Experimental results demonstrate the strong zero-shot localization performance of PIN on various image datasets such as Pascal VOC, COCO, and LVIS. The paper also compares PIN with other methods, proving that PIN effectively enhances the object localization capability of VLM without sacrificing its general ability.

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Teaching VLMs to Localize Specific Objects from In-context Examples

Probing the Role of Positional Information in Vision-Language Models

Locality Alignment Improves Vision-Language Models

RegionGPT: Towards Region Understanding Vision Language Model

Pixel Aligned Language Models

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Optimization Efficient Open-World Visual Region Recognition

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Bridging Vision and Language Spaces with Assignment Prediction

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Leveraging VLM-Based Pipelines to Annotate 3D Objects