Abstract:In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL-$5^i$ and COCO-$20^i$ datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness.

Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation

PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation

Beware of Overcorrection: Scene-induced Commonsense Graph for Scene Graph Generation

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Prototype-based Embedding Network for Scene Graph Generation

Attention Redirection Transformer with Semantic Oriented Learning for Unbiased Scene Graph Generation

Addressing Predicate Overlap in Scene Graph Generation with Semantic Granularity Controller

RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Synergetic Prototype Learning Network for Unbiased Scene Graph Generation

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Learning Visual Commonsense for Robust Scene Graph Generation

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

Multi-Modal Prototypes for Open-World Semantic Segmentation

Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation

Leveraging Predicate and Triplet Learning for Scene Graph Generation

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

Scene Graph Generation Via Multi-Relation Classification and Cross-Modal Attention Coordinator.

Decomposed Prototype Learning for Few-Shot Scene Graph Generation

Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

Towards Confidence-Aware Commonsense Knowledge Integration for Scene Graph Generation