Abstract:Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively utilize pre - trained vision - language models (VLMs) such as CLIP in the open - vocabulary image segmentation task to overcome the resolution curse that these models encounter when processing high - resolution images. Specifically, since VLMs are usually pre - trained on low - resolution images, this leads to their poor performance in tasks that require high - resolution details. Most existing methods adapt to the input requirements of VLMs by down - sampling the input image, but this will lose important segmentation details. In addition, although some methods attempt to provide high - resolution input through additional image backbone networks, this adds significant computational overhead. To address these problems, the paper proposes MROVSeg, a multi - resolution training framework that aims to use a single pre - trained CLIP backbone to extract global and local features simultaneously in the open - vocabulary image segmentation task. MROVSeg slices the high - resolution input into uniform patches that match the pre - trained image encoder through the sliding - window technique and introduces a Multi - Res Adapter to restore the spatial geometric structure and capture local - global correspondences. Moreover, in order to achieve accurate segmentation, the paper also proposes a Multi - grained Masked Attention scheme to enhance object queries by aggregating multi - grained semantics in multi - resolution CLIP features. In summary, the main contribution of this paper lies in providing an innovative method that can improve the ability to capture high - resolution details in the open - vocabulary image segmentation task without sacrificing computational efficiency, thereby enhancing the overall performance of the model.

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

MsVRL: Self-Supervised Multiscale Visual Representation Learning Via Cross-Level Consistency for Medical Image Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation.

M-segclip: Enhancing SegCLIP with RM-MLP for Open-Vocabulary Semantic Segmentation

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

Towards Universal Vision-language Omni-supervised Segmentation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Text4Seg: Reimagining Image Segmentation as Text Generation

Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model