Task-oriented Sequential Grounding in 3D Scenes

Zhuofan Zhang,Ziyu Zhu,Pengxiang Li,Tengyu Liu,Xiaojian Ma,Yixin Chen,Baoxiong Jia,Siyuan Huang,Qing Li
2024-08-08
Abstract:Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented grounding necessary for practical applications. In this work, we propose a new task: Task-oriented Sequential Grounding in 3D scenes, wherein an agent must follow detailed step-by-step instructions to complete daily activities by locating a sequence of target objects in indoor scenes. To facilitate this task, we introduce SG3D, a large-scale dataset containing 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed using a combination of RGB-D scans from various 3D scene datasets and an automated task generation pipeline, followed by human verification for quality assurance. We adapted three state-of-the-art 3D visual grounding models to the sequential grounding task and evaluated their performance on SG3D. Our results reveal that while these models perform well on traditional benchmarks, they face significant challenges with task-oriented sequential grounding, underscoring the need for further research in this area.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the new problem of Task-oriented Sequential Grounding in 3D Scenes (SG3D). Specifically, the research aims to enable agents to complete daily activities based on detailed step-by-step instructions, which requires the agent to find a series of target objects in indoor scenes. To achieve this goal, the paper contributes the following: 1. **Proposing a New Task**: A new task is proposed—Task-oriented Sequential Grounding in 3D Scenes. This task requires the agent to understand each step of the plan and identify the target objects in context, as a single step may not be sufficient to distinguish the target object from other similar objects. 2. **Constructing a Large-scale Dataset**: A large-scale dataset named SG3D is constructed, which includes 22,346 tasks, 112,236 steps, and 4,895 real-world 3D scenes. These scenes are obtained from RGB-D scan data from different 3D scene datasets and are created through an automated task generation pipeline and manual verification process to ensure high-quality tasks. 3. **Evaluating Existing Models**: Three state-of-the-art 3D visual grounding models (3D-VisTA, PQ3D, and LEO) are applied to the sequential grounding task and evaluated on the SG3D dataset. Experimental results show that although these models perform well on traditional benchmarks, they still face significant challenges in task-oriented sequential grounding. In summary, this paper aims to bridge the gap between existing 2D visual grounding methods and the task-driven, sequential object grounding required in real-world applications. By constructing a large-scale real-world dataset and evaluating existing models, it provides new research directions and technical foundations for task-oriented sequential grounding in 3D scenes.