Abstract:Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we propose a novel architecture, inference method, and relation prediction model. Our proposed solution, the REACT model, achieves the highest inference speed among existing SGG models, improving object detection accuracy without sacrificing relation prediction performance. Compared to state-of-the-art approaches, REACT is 2.7 times faster (with a latency of 23 ms) and improves object detection accuracy by 58.51%. Furthermore, our proposal significantly reduces model size, with an average of 5.5x fewer parameters. Code is available at <a class="link-external link-https" href="https://github.com/Maelic/SGG-Benchmark" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the trade - off problem between performance and inference speed in the Scene Graph Generation (SGG) task. Specifically, the SGG task needs to encode the visual relationships between objects in an image and represent them as a graph structure. Although SGG has shown great potential in downstream tasks such as agent reasoning, in order to achieve real - time applications, the balance between performance and inference speed must be addressed. Current methods usually focus on only one or two of the following aspects: 1. **Improve the accuracy of relationship prediction**. 2. **Enhance the accuracy of object detection**. 3. **Reduce latency**. However, few methods can optimize these three goals simultaneously. For this reason, the authors propose a new architecture, an inference method and a relationship prediction model - the REACT model. This model can significantly improve the accuracy of object detection and accelerate the inference speed without sacrificing the performance of relationship prediction. ### Main contributions 1. **New architecture (Decoupled Two - Stage, DTS)**: By decoupling the two - stage architecture, a real - time object detector (such as YOLO) can be combined with the state - of - the - art two - stage relationship prediction component, thus reducing the latency by up to 10 times. 2. **New inference method (Dynamic Candidate Selection, DCS)**: By dynamically selecting the optimal number of candidate objects, the latency is further reduced without affecting the accuracy. 3. **New model (REACT)**: Combining the DTS architecture and the DCS method, the REACT model performs well on standard metrics, with a latency of only 23 milliseconds (44 FPS) and a 5.5 - fold reduction in the number of parameters. ### Specific problems solved - **Object detection accuracy**: Traditional two - stage SGG methods will repeatedly decode object labels in the relationship prediction stage, resulting in redundant operations and a decline in object detection accuracy. By freezing the regression and classification heads of the object detector and performing non - maximum suppression (NMS) before relationship prediction, the accuracy of the object detector is maintained. - **Candidate object selection**: Traditional methods usually select a fixed number of high - confidence candidate objects, which increases the computational complexity. By introducing the DCS method, the optimal number of candidate objects can be dynamically selected during inference, reducing the computational burden. - **Latency problem**: Models such as PE - NET have a large number of redundant steps in the feature extraction and context learning processes, resulting in increased latency. By simplifying the feature processing flow and removing unnecessary feature up - sampling and fusion operations, the overall latency of the model is reduced. In summary, this paper proposes a new SGG method, aiming to achieve the best balance between performance and speed in real - time applications.

REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation

PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation

Modeling Dynamic Environments with Scene Graph Memory

SGTR+: End-to-end Scene Graph Generation with Transformer

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation

Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Towards Scene Graph Anticipation

Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection

Tackling the Challenges in Scene Graph Generation With Local-to-Global Interactions

Relation Regularized Scene Graph Generation

Fine-Grained Scene Graph Generation with Data Transfer

Location-Free Scene Graph Generation

NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment

Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction

Scene Graph Modification as Incremental Structure Expanding