CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding

Yunze Liu,Changxi Chen,Zifan Wang,Li Yi
2024-01-17
Abstract:This paper introduces a novel approach named CrossVideo, which aims to enhance self-supervised cross-modal contrastive learning in the field of point cloud video understanding. Traditional supervised learning methods encounter limitations due to data scarcity and challenges in label acquisition. To address these issues, we propose a self-supervised learning method that leverages the cross-modal relationship between point cloud videos and image videos to acquire meaningful feature representations. Intra-modal and cross-modal contrastive learning techniques are employed to facilitate effective comprehension of point cloud video. We also propose a multi-level contrastive approach for both modalities. Through extensive experiments, we demonstrate that our method significantly surpasses previous state-of-the-art approaches, and we conduct comprehensive ablation studies to validate the effectiveness of our proposed designs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issues of data scarcity and difficulty in obtaining labels in point cloud video understanding. Specifically, the paper proposes a new method called CrossVideo, which leverages self-supervised cross-modal contrastive learning to enhance the understanding of point cloud videos. **The main objectives include:** 1. **Overcoming the limitations of traditional supervised learning**: - Due to the scarcity of point cloud video data and the difficulty in obtaining labels, traditional supervised learning methods have encountered bottlenecks. To address these issues, the paper proposes a self-supervised learning method that utilizes the cross-modal relationship between point cloud videos and image videos to obtain meaningful feature representations. 2. **Introducing cross-modal contrastive learning**: - Using cross-modal contrastive learning techniques (including intra-modal and inter-modal contrastive learning) to promote effective understanding of point cloud videos, and proposing a multi-level contrastive learning method. 3. **Validating the effectiveness of the method**: - Through extensive experiments, it is shown that this method significantly outperforms existing state-of-the-art methods on multiple benchmark tasks, and comprehensive ablation studies are conducted to validate the effectiveness of the design. **Specifically, the main contributions of the paper are as follows:** 1. Proposing the first 4D self-supervised cross-modal representation learning method that utilizes collaborative learning between image videos and point cloud videos. 2. Utilizing intra-modal and cross-modal contrastive learning to promote effective understanding of point cloud videos. 3. Comparing the features of the two modalities at different levels to further enhance representation capabilities. 4. Experimental results show that this method achieves significant improvements on multiple downstream tasks and provides detailed ablation studies to validate the effectiveness of the design. Through these methods, the paper aims to improve the accuracy and robustness of point cloud video understanding, with broad application prospects in fields such as autonomous driving, robotic navigation, and augmented reality.