Abstract:In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands for (fine-grained) segmenting specific parts for 3D meshes with contextual awareness and reasoned answers for interactive segmentation. Specifically, Reasoning3D leverages an off-the-shelf pre-trained 2D segmentation network, powered by Large Language Models (LLMs), to interpret user input queries in a zero-shot manner. Previous research have shown that extensive pre-training endows foundation models with prior world knowledge, enabling them to comprehend complex commands, a capability we can harness to "segment anything" in 3D with limited 3D datasets (source efficient). Experimentation reveals that our approach is generalizable and can effectively localize and highlight parts of 3D objects (in 3D mesh) based on implicit textual queries, including these articulated 3d objects and real-world scanned data. Our method can also generate natural language explanations corresponding to these 3D models and the decomposition. Moreover, our training-free approach allows rapid deployment and serves as a viable universal baseline for future research of part-level 3d (semantic) object understanding in various fields including robotics, object manipulation, part assembly, autonomous driving applications, augment reality and virtual reality (AR/VR), and medical applications. The code, the model weight, the deployment guide, and the evaluation protocol are: <a class="link-external link-http" href="http://tianrun-chen.github.io/Reason3D/" rel="external noopener nofollow">this http URL</a>

To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

3D Scene Graph Guided Vision-Language Pre-training

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

Improving Generalization in Visual Reasoning via Self-Ensemble

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Good Questions Help Zero-Shot Image Reasoning

Zero-shot Commonsense Reasoning over Machine Imagination

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Extracting Zero-shot Common Sense from Large Language Models for Robot 3D Scene Understanding

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Incorporating Scene Graphs into Pre-trained Vision-Language Models for Multimodal Open-vocabulary Action Recognition