Abstract:Visual relationship understanding has been studied separately in human-object interaction(HOI) detection, scene graph generation(SGG), and referring relationships(RR) tasks. Given the complexity and interconnectedness of these tasks, it is crucial to have a flexible framework that can effectively address these tasks in a cohesive manner. In this work, we propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation, and further possesses the capability for open-vocabulary segmentation to adapt to novel scenarios. FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. Empirical validation across various datasets demonstrates that our framework outperforms existing models in standard, promptable, and open-vocabulary tasks, e.g., +1.9 $mAP$ on HICO-DET, +11.4 $Acc$ on VRD, +4.7 $mAP$ on unseen HICO-DET. Our FleVRS represents a significant step towards a more intuitive, comprehensive, and scalable understanding of visual relationships.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the integration and flexible handling of several key tasks in visual relationship understanding. Specifically, the authors propose a new framework named FleVRS, aiming to seamlessly combine three tasks: Human - Object Interaction (HOI) detection, Panoptic Scene Graph Generation (SGG), and Referring Relationships (RR). These problems are usually dealt with separately in existing research, but FleVRS attempts to solve these tasks simultaneously in a unified model and also has the ability of open - vocabulary segmentation to adapt to new scenarios. ### Specific Objectives: 1. **Multi - type Relationship Segmentation**: - It includes human - centered relationships (such as a person riding a horse) and general relationships (such as a bench on the sidewalk). These relationships are defined in the form of triples: <subject, predicate, object>. 2. **Prompt - based Relationship Segmentation**: - Given different text prompts, the model can output specified entities and relationships, thus providing a more natural and intuitive user interface. For example, specific relationships in an image can be detected according to the prompt "<person, ride, horse>". 3. **Open - vocabulary Relationship Recognition**: - In real - world applications, the model should be able to generalize to new scenarios without annotating new concepts not seen during the training process. This includes detecting new objects, new relationships and their combinations. ### Shortcomings of Existing Methods: - Although existing Visual Relationship Segmentation (VRS) models have made progress in some aspects, they have not yet provided a comprehensive solution. Most models focus on specific tasks, such as HOI detection or panoptic SGG, lacking the ability of dynamic prompt handling and open - vocabulary segmentation. - Some models require additional pre - training datasets or cannot effectively handle multi - label interactions. ### Contributions of FleVRS: 1. **Flexible One - stage Framework**: - FleVRS is a flexible one - stage framework that can perform standard, prompt - based, and open - vocabulary visual relationship segmentation simultaneously. 2. **Unified Model Architecture**: - By adopting SAM (Segment Anything Model) to unify different types of annotations into segmentation masks and using a query - based Transformer architecture to output triples, FleVRS achieves effective management of different types of tasks. 3. **Strong Generalization Ability**: - It performs well in both standard closed - set and open - vocabulary scenarios, demonstrating the strong generalization ability of the model. ### Method Overview: - **Standard VRS**: - Input an image and output triples (including segmentation masks and categories) of all visual relationships of interest. - **Prompt - based VRS**: - Accept text prompts as input and output triples that match the prompts. - **Open - vocabulary VRS**: - Use the text encoder of the CLIP model to align visual features with textual knowledge, supporting the recognition and classification of new concepts. ### Experimental Results: - **HOI Segmentation**: - Experiments on the HICO - DET and V - COCO datasets show that FleVRS outperforms existing methods on multiple metrics, especially performing prominently in open - vocabulary scenarios. - **Panoptic SGG**: - Experiments on the PSG dataset also verify the effectiveness of FleVRS, especially performing well in recall rate and average recall rate. In conclusion, through its flexible design and strong generalization ability, FleVRS provides a brand - new solution for visual relationship understanding.

Towards Flexible Visual Relationship Segmentation

Visual relationship detection with a deep convolutional relationship network

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Text-Vision Relationship Alignment for Referring Image Segmentation

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

Visual Relationship Detection With Image Position and Feature Information Embedding and Fusion

Visual relationship detection with region topology structure

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Visual Relationship Detection: A Survey

Relational reasoning and adaptive fusion for visual question answering

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

Hierarchical Visual Relationship Detection

A Multimodal Approach for Multiple-Relation Extraction in Videos

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding