RGB2Point: 3D Point Cloud Generation from Single RGB Images

Jae Joong Lee,Bedrich Benes
2024-11-01
Abstract:We introduce RGB2Point, an unposed single-view RGB image to a 3D point cloud generation based on Transformer. RGB2Point takes an input image of an object and generates a dense 3D point cloud. Contrary to prior works based on CNN layers and diffusion denoising approaches, we use pre-trained Transformer layers that are fast and generate high-quality point clouds with consistent quality over available categories. Our generated point clouds demonstrate high quality on a real-world dataset, as evidenced by improved Chamfer distance (51.15%) and Earth Mover's distance (45.96%) metrics compared to the current state-of-the-art. Additionally, our approach shows a better quality on a synthetic dataset, achieving better Chamfer distance (39.26%), Earth Mover's distance (26.95%), and F-score (47.16%). Moreover, our method produces 63.1% more consistent high-quality results across various object categories compared to prior works. Furthermore, RGB2Point is computationally efficient, requiring only 2.3GB of VRAM to reconstruct a 3D point cloud from a single RGB image, and our implementation generates the results 15,133x faster than a SOTA diffusion-based model.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the problem of generating 3D point clouds from a single RGB image. Specifically, the authors propose a new method called RGB2Point, which is based on the Transformer model and can generate high-quality 3D point clouds from a single RGB image. Compared to existing methods based on CNN and diffusion denoising, RGB2Point excels in several aspects: 1. **Generation Quality**: The point clouds generated by RGB2Point exhibit higher quality and consistency on both real-world and synthetic datasets. 2. **Computational Efficiency**: This method requires fewer computational resources to generate 3D point clouds, needing only 2.3GB of VRAM, and is 15,133 times faster than the current state-of-the-art diffusion models. 3. **Stability**: RGB2Point shows more stable generation quality across different categories of objects, demonstrating better generalization ability. ### Main Contributions 1. **Efficiency**: RGB2Point can generate high-quality 3D point clouds with only 2.3GB of VRAM, significantly reducing hardware requirements. 2. **Speed**: This method generates 3D point clouds 15,133 times faster than existing diffusion models. 3. **High-Quality Reconstruction**: RGB2Point performs excellently on metrics such as Chamfer Distance and Earth Mover’s Distance, improving by 39.26% and 26.95%, respectively. 4. **Stability**: The method shows more consistent generation quality across different categories of objects, with lower standard deviation, indicating better generalization ability. ### Method Overview The main architecture of RGB2Point includes three parts: 1. **2D Image Feature Extraction**: Using a pre-trained Vision Transformer (ViT) to extract features from the input image. 2. **Context Feature Integrator (CFI)**: Enhancing feature representations of specific regions through multi-head attention mechanisms and feed-forward layers. 3. **Geometric Projection Module (GPM)**: Mapping the extracted features into 3D space to generate point clouds. ### Experimental Results 1. **Quantitative Evaluation**: On the ShapeNet and Pix3D datasets, RGB2Point outperforms existing methods on multiple metrics such as Chamfer Distance, Earth Mover’s Distance, and F-score. 2. **Qualitative Evaluation**: The generated point clouds are visually more accurate and detailed, especially when handling complex real-world data. 3. **Ablation Studies**: By adjusting different hyperparameters and modules, the impact of each part on overall performance was verified, further proving the model's effectiveness and robustness. ### Conclusion By leveraging the advantages of the Transformer model, RGB2Point successfully addresses the problem of generating high-quality 3D point clouds from a single RGB image. This method not only excels in generation quality but also has significant advantages in computational efficiency and stability, providing a new solution for 3D point cloud generation tasks.