Abstract:Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models. We then propose "in-situ" machine learning, which, in conjunction with the multimodal representation, enables new tools and interfaces for users to interact with physical spaces and objects in a spatially and linguistically meaningful manner. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time. We also make our full implementation and demo data available at (<a class="link-external link-https" href="https://github.com/cy-xu/spatially_aware_AI" rel="external noopener nofollow">this https URL</a>) to encourage further exploration and research in spatially aware AI.

OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality

OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using Semantic Understanding in Mixed Reality

Augmented Object Intelligence with XR-Objects

OVO: Open-Vocabulary Occupancy

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds

Open-vocabulary object 6D pose estimation

Open-Vocabulary Category-Level Object Pose and Size Estimation

A Monocular SLAM-based Multi-User Positioning System with Image Occlusion in Augmented Reality

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

30‐4: Semantic Simultaneous Localization and Mapping for Augmented Reality

Auto-Vocabulary Segmentation for LiDAR Points

Structured Spatial Reasoning with Open Vocabulary Object Detectors

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Large Language Model-assisted Speech and Pointing Benefits Multiple 3D Object Selection in Virtual Reality

Stimulating Imagination: Towards General-purpose Object Rearrangement