Abstract:The ability to understand and reason the 3D real world is a crucial milestone towards artificial general intelligence. The current common practice is to finetune Large Language Models (LLMs) with 3D data and texts to enable 3D understanding. Despite their effectiveness, these approaches are inherently limited by the scale and diversity of the available 3D data. Alternatively, in this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework addressing the 3D scene understanding in a zero-shot manner. The essence of our approach centers on reconceptualizing the challenge of 3D scene perception as a process of understanding and synthesizing insights from multiple images, inspired by how our human beings attempt to understand 3D scenes. By consolidating this idea, we propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. Specifically, given an input 3D scene, Agent3D-Zero first processes a bird's-eye view image with custom-designed visual prompts, then iteratively chooses the next viewpoints to observe and summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints and thus facilitate observing 3D scenes. Extensive experiments demonstrate the effectiveness of the proposed framework in understanding diverse and previously unseen 3D environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to enable large - scale vision - language models (VLMs) to understand and perceive 3D scenes without additional training. Specifically, the paper proposes a new framework named Agent3D - Zero, aiming to achieve zero - sample understanding of 3D scenes through multi - view image input. This method avoids the need for a large amount of 3D data in traditional methods, thus overcoming the difficulties in 3D data collection and annotation. ### Main Problems 1. **Challenges in 3D Scene Understanding**: - Current methods usually rely on large - scale 3D datasets for fine - tuning, but the acquisition and annotation of these datasets are very time - consuming and costly. - Publicly available 3D datasets have limited diversity, mainly focusing on CAD models, indoor environments, and autonomous driving scenarios. 2. **The Need for Zero - Sample Learning**: - Researchers hope to develop a method that can use pre - trained VLMs to understand and perceive 3D scenes without additional training. - This method needs to be able to perform well in multiple tasks, such as 3D question - answering, 3D - assisted dialogue, 3D scene description, and 3D semantic segmentation. ### Solutions - **Agent3D - Zero Framework**: - Through multi - view image input, utilize the multi - modal capabilities of VLMs to achieve zero - sample understanding of 3D scenes. - Introduce a new visual prompting technique - Set - of - Line Prompting (SoLP), which enhances the spatial understanding ability of VLMs by superimposing grid lines and direction markers on the bird - eye view. - **Multi - view Selection**: - Agent3D - Zero can actively select multiple views for observation, thereby understanding 3D scenes more comprehensively. - Improve the understanding accuracy of 3D scenes by iteratively selecting the most informative views. - **Task Adaptability**: - Through task - specific prompts, Agent3D - Zero can handle multiple 3D understanding tasks, such as 3D question - answering, 3D - assisted dialogue, 3D scene description, and 3D semantic segmentation. ### Experimental Results - **3D Question - Answering Task**: - Experiments on the ScanQA dataset show that Agent3D - Zero outperforms or is close to many models that require fine - tuning in the zero - sample case. - It performs particularly well on evaluation metrics such as METEOR, ROUGE - L, and CIDEr. - **3D - Assisted Dialogue**: - Experiments on the 3D - LLM held - in dataset show that Agent3D - Zero also performs well in dialogue tasks and can effectively use spatial information for dialogue. - **3D Scene Description**: - Through random selection and iterative selection of views, Agent3D - Zero can generate detailed 3D scene descriptions, demonstrating its understanding ability in complex scenes. - **3D Semantic Segmentation**: - Although in the zero - sample case, the 3D semantic segmentation performance of Agent3D - Zero is not as good as that of traditional supervised methods, its results are still somewhat competitive, demonstrating the potential of VLMs in 3D perception tasks. ### Summary The paper successfully solves the problem of using VLMs to understand and perceive 3D scenes without additional training by proposing the Agent3D - Zero framework. This method not only reduces the dependence on a large amount of 3D data but also demonstrates the wide applicability and potential of VLMs in multiple 3D understanding tasks.

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

Extracting Zero-shot Common Sense from Large Language Models for Robot 3D Scene Understanding

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

An Embodied Generalist Agent in 3D World

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

Leveraging Large Language Models for Robot 3D Scene Understanding

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems

Zero-1-to-3: Zero-shot One Image to 3D Object

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding