RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Hongchi Xia,Yang Fu,Sifei Liu,Xiaolong Wang
2024-07-28
Abstract:We introduce a new RGB-D object dataset captured in the wild called WildRGB-D. Unlike most existing real-world object-centric datasets which only come with RGB capturing, the direct capture of the depth channel allows better 3D annotations and broader downstream applications. WildRGB-D comprises large-scale category-level RGB-D object videos, which are taken using an iPhone to go around the objects in 360 degrees. It contains around 8500 recorded objects and nearly 20000 RGB-D videos across 46 common object categories. These videos are taken with diverse cluttered backgrounds with three setups to cover as many real-world scenarios as possible: (i) a single object in one video; (ii) multiple objects in one video; and (iii) an object with a static hand in one video. The dataset is annotated with object masks, real-world scale camera poses, and reconstructed aggregated point clouds from RGBD videos. We benchmark four tasks with WildRGB-D including novel view synthesis, camera pose estimation, object 6d pose estimation, and object surface reconstruction. Our experiments show that the large-scale capture of RGB-D objects provides a large potential to advance 3D object learning. Our project page is <a class="link-external link-https" href="https://wildrgbd.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to construct a large-scale real-world RGB-D (Red-Green-Blue-Depth) object dataset, called WildRGB-D. Unlike most existing real-world object datasets that only contain RGB images, directly capturing the depth channel can provide better 3D annotations and a wider range of applications. Specifically, the paper attempts to address the following issues: 1. **Lack of Large-Scale Real-World RGB-D Data**: - Existing 3D object datasets are mostly synthetic data or partially real scanned data, lacking large-scale real-world multi-view RGB-D videos. - This leads to limited model performance in real-world applications, as synthetic data is difficult to simulate real textures, shapes, backgrounds, and natural lighting. 2. **3D Object Learning in Multi-View and Complex Scenes**: - Existing datasets usually cover limited angles and scenes, failing to fully reflect the diversity and complexity of the real world. - The WildRGB-D dataset records 360-degree videos, covering various scenes such as single objects, multiple objects, and handheld objects, increasing the diversity and complexity of the data. 3. **Performance Improvement in Downstream Tasks**: - This dataset is used to evaluate four downstream tasks: novel view synthesis, camera pose estimation, object 6D pose estimation, and object surface reconstruction. - Experimental results show that large-scale RGB-D data capture provides great potential for 3D object learning, especially in tasks such as novel view synthesis and camera pose estimation. 4. **Application of Self-Supervised Learning**: - The dataset also explores the application of self-supervised learning in object 6D pose estimation, demonstrating that effective self-supervised training can be achieved with large-scale RGB-D images even without training labels. In summary, by constructing the large-scale WildRGB-D dataset, this paper aims to address the shortcomings of existing datasets in real-world 3D object learning, promoting research and applications in related fields.