Abstract:We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the improvement of the current situation of indoor 3D object detection. Specifically:
1. **Limitations of datasets**: Existing datasets have significant limitations in terms of scale, accuracy, and object diversity. These datasets are often small in scale, coarsely labeled, or lack accurate mapping from world space to image space. This has led to datasets mainly focusing on room - defined objects (such as chairs, beds, tables), while ignoring small objects common in daily life.
2. **Limitations of model design**: Existing 3D object detection models usually rely on point cloud or voxel representations. These methods require complex computational mechanisms (such as sparse convolution) to process 3D data and introduce strong 3D inductive biases to overcome the limited dataset scale. However, these methods have a low resolution and cannot capture a large number of small objects in the scene, especially when using depth maps obtained from commodity - level LiDAR sensors, and these methods have poor ability to handle noise and uncertainty.
To solve the above problems, the paper makes two main contributions:
1. **Constructing a large - scale, decoupled dataset**: The Cubify Anything 1M (CA - 1M) dataset is introduced. This dataset exhaustively labels more than 400,000 3D objects in more than 1,000 high - precision laser - scanned scenes, and these labels are accurately aligned with more than 3,500 image frames captured by handheld devices. The CA - 1M dataset not only provides accurate spatial reality in 3D space but also provides pixel - level accurate labels in each image frame.
2. **Designing a Transformer - based 3D object detection model**: The Cubify Transformer (CuTR) is proposed, which is a fully Transformer - based 3D object detection baseline model. CuTR directly predicts 3D boxes from 2D features (from RGB(-D) input) without lifting the input to 3D space. Although this method lacks any 3D inductive bias, by pairing with the CA - 1M dataset, CuTR can perform excellently in 3D object detection, accurately recalling more than 62% of objects, and showing stronger ability in handling noise and uncertainty in commodity - level LiDAR - derived depth maps.
Overall, this paper aims to promote the development of indoor 3D object detection technology by constructing large - scale, high - precision, and diverse datasets and designing efficient 3D object detection models, enabling it to handle various objects more effectively.