Understanding 3D Object Interaction from a Single Image

Shengyi Qian,David F. Fouhey
2023-08-05
Abstract:Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data. Project site: <a class="link-external link-https" href="https://jasonqsy.github.io/3DOI/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the problem of understanding 3D object interactions from a single image. Specifically: 1. **Task Definition**: - Given an RGB image and a set of query points, predict whether an object is movable, its position, rigidity, joint type (rotational or translational), action (pulling, pushing, or others), and potential interaction points. 2. **Dataset Construction**: - Constructed a new dataset named 3D Object Interaction (3DOI), which includes diverse data from internet videos, first-person perspective videos, and indoor scene renderings to ensure the model can generalize to new environments. 3. **Model Design**: - Proposed a Transformer-based model that extends multiple prediction heads on top of a detection backbone network (such as Segment-Anything) to accomplish the aforementioned tasks, and can be trained end-to-end. Through these methods, the researchers hope to endow machines with human-like abilities to understand potential 3D interaction scenarios from a single image, thereby improving object manipulation or exploration of three-dimensional spaces.