REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation

Maëlic Neau,Paulo E. Santos,Anne-Gwenn Bosser,Cédric Buche
2024-11-30
Abstract:Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we propose a novel architecture, inference method, and relation prediction model. Our proposed solution, the REACT model, achieves the highest inference speed among existing SGG models, improving object detection accuracy without sacrificing relation prediction performance. Compared to state-of-the-art approaches, REACT is 2.7 times faster (with a latency of 23 ms) and improves object detection accuracy by 58.51%. Furthermore, our proposal significantly reduces model size, with an average of 5.5x fewer parameters. Code is available at <a class="link-external link-https" href="https://github.com/Maelic/SGG-Benchmark" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the trade - off problem between performance and inference speed in the Scene Graph Generation (SGG) task. Specifically, the SGG task needs to encode the visual relationships between objects in an image and represent them as a graph structure. Although SGG has shown great potential in downstream tasks such as agent reasoning, in order to achieve real - time applications, the balance between performance and inference speed must be addressed. Current methods usually focus on only one or two of the following aspects: 1. **Improve the accuracy of relationship prediction**. 2. **Enhance the accuracy of object detection**. 3. **Reduce latency**. However, few methods can optimize these three goals simultaneously. For this reason, the authors propose a new architecture, an inference method and a relationship prediction model - the REACT model. This model can significantly improve the accuracy of object detection and accelerate the inference speed without sacrificing the performance of relationship prediction. ### Main contributions 1. **New architecture (Decoupled Two - Stage, DTS)**: By decoupling the two - stage architecture, a real - time object detector (such as YOLO) can be combined with the state - of - the - art two - stage relationship prediction component, thus reducing the latency by up to 10 times. 2. **New inference method (Dynamic Candidate Selection, DCS)**: By dynamically selecting the optimal number of candidate objects, the latency is further reduced without affecting the accuracy. 3. **New model (REACT)**: Combining the DTS architecture and the DCS method, the REACT model performs well on standard metrics, with a latency of only 23 milliseconds (44 FPS) and a 5.5 - fold reduction in the number of parameters. ### Specific problems solved - **Object detection accuracy**: Traditional two - stage SGG methods will repeatedly decode object labels in the relationship prediction stage, resulting in redundant operations and a decline in object detection accuracy. By freezing the regression and classification heads of the object detector and performing non - maximum suppression (NMS) before relationship prediction, the accuracy of the object detector is maintained. - **Candidate object selection**: Traditional methods usually select a fixed number of high - confidence candidate objects, which increases the computational complexity. By introducing the DCS method, the optimal number of candidate objects can be dynamically selected during inference, reducing the computational burden. - **Latency problem**: Models such as PE - NET have a large number of redundant steps in the feature extraction and context learning processes, resulting in increased latency. By simplifying the feature processing flow and removing unnecessary feature up - sampling and fusion operations, the overall latency of the model is reduced. In summary, this paper proposes a new SGG method, aiming to achieve the best balance between performance and speed in real - time applications.