Fine-Grained Visual Text Prompting

Lingfeng Yang,Xiang Li,Yueze Wang,Xinlong Wang,Jian Yang
DOI: https://doi.org/10.1109/tpami.2024.3504568
IF: 23.6
2024-01-01
IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods often include irrelevant and noisy pixels, leading to suboptimal performance. The design of better visual prompts and their collaboration with text prompting remains underexplored. This paper introduces Fine-Grained Visual Text Prompting (FGVTP), a new zero-shot framework for object-based tasks using precise semantic masks and reinforced image-text alignment. FGVTP comprises Fine-Grained Visual Prompting (FGVP) and Consistency-Enhanced Text Prompting (CETP). Specifically, we carefully study visual prompting designs by exploring more visual markings that vary in shape and form. FGVP uses semantic masks from a segmenter like the Segment Anything Model (SAM) and employs background blurring (Blur Reverse Mask) to highlight targets while maintaining spatial coherence. Further, CETP enhances image-text alignment by prompting captions based on FGVP-processed images. As a result, FGVTP achieves superior zero-shot referring expression comprehension on RefCOCO/+/g benchmarks, outperforming previous SOTA methods by 5.8% on average. Part detection experiments conducted on the PACO dataset further validate the preponderance of FGVTP over existing works. Code is available at https://github.com/ylingfeng/FGVP .
What problem does this paper attempt to address?