MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

Yuanbing Zhu,Bingke Zhu,Yingying Chen,Yunfang Niu,Ming Tang,Jinqiao Wang
2024-11-27
Abstract:Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize pre - trained vision - language models (VLMs) such as CLIP in the open - vocabulary image segmentation task to overcome the resolution curse that these models encounter when processing high - resolution images. Specifically, since VLMs are usually pre - trained on low - resolution images, this leads to their poor performance in tasks that require high - resolution details. Most existing methods adapt to the input requirements of VLMs by down - sampling the input image, but this will lose important segmentation details. In addition, although some methods attempt to provide high - resolution input through additional image backbone networks, this adds significant computational overhead. To address these problems, the paper proposes MROVSeg, a multi - resolution training framework that aims to use a single pre - trained CLIP backbone to extract global and local features simultaneously in the open - vocabulary image segmentation task. MROVSeg slices the high - resolution input into uniform patches that match the pre - trained image encoder through the sliding - window technique and introduces a Multi - Res Adapter to restore the spatial geometric structure and capture local - global correspondences. Moreover, in order to achieve accurate segmentation, the paper also proposes a Multi - grained Masked Attention scheme to enhance object queries by aggregating multi - grained semantics in multi - resolution CLIP features. In summary, the main contribution of this paper lies in providing an innovative method that can improve the ability to capture high - resolution details in the open - vocabulary image segmentation task without sacrificing computational efficiency, thereby enhancing the overall performance of the model.