Grounding 3D Scene Affordance From Egocentric Interactions

Cuiyu Liu,Wei Zhai,Yuhang Yang,Hongchen Luo,Sen Liang,Yang Cao,Zheng-Jun Zha
2024-09-29
Abstract:Grounding 3D scene affordance aims to locate interactive regions in 3D environments, which is crucial for embodied agents to interact intelligently with their surroundings. Most existing approaches achieve this by mapping semantics to 3D instances based on static geometric structure and visual appearance. This passive strategy limits the agent's ability to actively perceive and engage with the environment, making it reliant on predefined semantic instructions. In contrast, humans develop complex interaction skills by observing and imitating how others interact with their surroundings. To empower the model with such abilities, we introduce a novel task: grounding 3D scene affordance from egocentric interactions, where the goal is to identify the corresponding affordance regions in a 3D scene based on an egocentric video of an interaction. This task faces the challenges of spatial complexity and alignment complexity across multiple sources. To address these challenges, we propose the Egocentric Interaction-driven 3D Scene Affordance Grounding (Ego-SAG) framework, which utilizes interaction intent to guide the model in focusing on interaction-relevant sub-regions and aligns affordance features from different sources through a bidirectional query decoder mechanism. Furthermore, we introduce the Egocentric Video-3D Scene Affordance Dataset (VSAD), covering a wide range of common interaction types and diverse 3D environments to support this task. Extensive experiments on VSAD validate both the feasibility of the proposed task and the effectiveness of our approach.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately locate the interactive areas in a three - dimensional environment, that is, how to enable agents to understand and locate the interactive areas in 3D scenes based on egocentric interaction videos. This is a crucial task for agents to carry out intelligent interactions in complex environments. ### Specific Problem Description 1. **Limitations of Existing Methods**: - Existing methods mainly rely on static geometric structures and visual appearances to map semantics to 3D instances. This method restricts the agent's ability to actively perceive and interact with the environment, making it rely on predefined semantic instructions. - Although reinforcement learning methods can learn to understand the interactivity of scenes through active exploration, they require a large number of trials to converge, and there is a gap between the simulated environment and the real environment. 2. **Human Learning Methods**: - Humans develop complex interaction skills by observing and imitating others' interactions with the environment. This observation - based learning method is more flexible and efficient. 3. **Proposing a New Task**: - In order to endow the model with human - like learning ability, the paper proposes a new task: locating the interactive areas of 3D scenes from egocentric interaction videos (grounding 3D scene affordance from egocentric interactions). This task aims to identify the corresponding interactive areas in 3D scenes according to egocentric videos. ### Main Challenges 1. **Spatial Complexity**: - In 3D environments, the complexity of spatial structures makes most areas unimportant for interaction, resulting in ambiguity when locating the interactivity of scenes. - It is necessary to model the relationship between interaction intentions and the layout of 3D scene sub - regions to accurately identify the regions that are crucial for specific interactions. 2. **Alignment Complexity**: - Changes in different user habits, object appearances, and background settings will cause the same interaction to be presented differently in different videos, and at the same time, the corresponding interactive areas will also be significantly different in size, position, and structure in different scenes. - It is necessary to align these changes in the feature space and extract the regions with common interactive characteristics. ### Solutions To solve these problems, the paper proposes a new framework - Ego - SAG (Egocentric Interaction - driven 3D Scene Affordance Grounding), which includes two key modules: 1. **Interaction - guided Spatial Saliency Allocation Module (ISA)**: - It is used to deal with spatial complexity. It extracts local sub - region features through sampling and grouping strategies, and uses the multi - head cross - attention mechanism to model the relationship between interaction intentions and sub - region layouts, giving priority to the regions most relevant to specific interactions. 2. **Bidirectional Query Decoder Module (BQD)**: - Through the bidirectional query decoding mechanism, it gradually extracts and optimizes high - dimensional alignment between different modalities, revealing the explicit 3D scene interactivity. In addition, the paper also introduces a new dataset VSAD (Video - 3D Scene Affordance Dataset), which covers a wide range of common interaction types and diverse 3D environments, supporting the research of this new task. ### Summary The main contributions of this paper are: 1. Proposing the task of locating 3D scene interactivity from the first - person perspective and establishing a large - scale VSAD benchmark dataset. 2. Proposing the Ego - SAG framework, using interaction intentions to guide the model to focus on the interaction - related sub - regions in the scene, and aligning the interactive features in videos and 3D scenes through a bidirectional query mechanism. 3. Experimental results show that Ego - SAG is significantly superior to other representative methods in multiple related fields and can be used as a strong baseline for future research.