Abstract:In this paper we study the task of a single-view image-guided point cloud completion. Existing methods have got promising results by fusing the information of image into point cloud explicitly or implicitly. However, given that the image has global shape information and the partial point cloud has rich local details, We believe that both modalities need to be given equal attention when performing modality fusion. To this end, we propose a novel dual-channel modality fusion network for image-guided point cloud completion(named DMF-Net), in a coarse-to-fine manner. In the first stage, DMF-Net takes a partial point cloud and corresponding image as input to recover a coarse point cloud. In the second stage, the coarse point cloud will be upsampled twice with shape-aware upsampling transformer to get the dense and complete point cloud. Extensive quantitative and qualitative experimental results show that DMF-Net outperforms the state-of-the-art unimodal and multimodal point cloud completion works on ShapeNet-ViPC dataset.
What problem does this paper attempt to address?
### The Problem Addressed by the Paper
This paper aims to address the task of point cloud completion guided by single-view images. Existing methods have achieved satisfactory results by explicitly or implicitly integrating image information into point clouds. However, images contain global shape information, while partial point clouds contain rich local details. These two modalities need to be given equal attention during modality fusion. To this end, the authors propose a novel Dual-Channel Modality Fusion Network (DMF-Net) for point cloud completion guided by single-view images.
### Specific Problem Description
1. **Quality Issues of Point Cloud Data**:
- Point cloud data is often incomplete and sparse due to self-occlusion, occlusion between objects, uneven lighting, and low scanning resolution.
- These low-quality point cloud data make it difficult to understand 3D shapes, thereby affecting the performance of many downstream tasks such as point cloud detection, point cloud reconstruction, and point cloud upsampling.
2. **Limitations of Existing Methods**:
- Most existing learning methods adopt an encoder-decoder architecture, where the encoder extracts global latent vectors from partial inputs, and the decoder generates complete point clouds based on the global latent representation.
- However, the global information extracted from incomplete point clouds may be ambiguous and misleading.
- Although some multimodal completion networks improve performance by introducing single-view images, the modality fusion process of these methods is mainly dominated by the point cloud modality, lacking reasonable utilization of the image modality.
### Solution
To overcome the above issues, the authors propose DMF-Net, which has the following features:
1. **Dual-Channel Modality Fusion**:
- In the encoding stage, DMF-Net proposes a dual-channel modality fusion strategy that can symmetrically capture complementary information from images and point clouds.
- In this way, the image modality not only guides the point cloud modality but also complements it, ensuring that both modalities have equal importance during modality fusion.
2. **Shape-Aware Upsampling Transformer**:
- In the second stage, DMF-Net utilizes a shape-aware upsampling transformer to upsample the coarse point cloud twice, generating a dense and complete point cloud.
- The shape-aware upsampling transformer captures local geometric details by encouraging self-attention mechanisms among local neighborhood points.
3. **Two-Stage Generation**:
- In the first stage, DMF-Net adopts an encoder-decoder architecture to generate a sparse but complete point cloud.
- In the second stage, the coarse point cloud is upsampled to generate a dense and complete point cloud.
### Experimental Results
Extensive quantitative and qualitative experimental results show that DMF-Net outperforms existing unimodal and multimodal point cloud completion methods on the ShapeNet-ViPC dataset.
### Conclusion
By proposing DMF-Net, this paper addresses the issues of insufficient modality fusion and lack of local details in the task of point cloud completion guided by single-view images, significantly improving the quality and performance of point cloud completion.