Abstract:This paper introduces a novel approach named CrossVideo, which aims to enhance self-supervised cross-modal contrastive learning in the field of point cloud video understanding. Traditional supervised learning methods encounter limitations due to data scarcity and challenges in label acquisition. To address these issues, we propose a self-supervised learning method that leverages the cross-modal relationship between point cloud videos and image videos to acquire meaningful feature representations. Intra-modal and cross-modal contrastive learning techniques are employed to facilitate effective comprehension of point cloud video. We also propose a multi-level contrastive approach for both modalities. Through extensive experiments, we demonstrate that our method significantly surpasses previous state-of-the-art approaches, and we conduct comprehensive ablation studies to validate the effectiveness of our proposed designs.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issues of data scarcity and difficulty in obtaining labels in point cloud video understanding. Specifically, the paper proposes a new method called CrossVideo, which leverages self-supervised cross-modal contrastive learning to enhance the understanding of point cloud videos. **The main objectives include:** 1. **Overcoming the limitations of traditional supervised learning**: - Due to the scarcity of point cloud video data and the difficulty in obtaining labels, traditional supervised learning methods have encountered bottlenecks. To address these issues, the paper proposes a self-supervised learning method that utilizes the cross-modal relationship between point cloud videos and image videos to obtain meaningful feature representations. 2. **Introducing cross-modal contrastive learning**: - Using cross-modal contrastive learning techniques (including intra-modal and inter-modal contrastive learning) to promote effective understanding of point cloud videos, and proposing a multi-level contrastive learning method. 3. **Validating the effectiveness of the method**: - Through extensive experiments, it is shown that this method significantly outperforms existing state-of-the-art methods on multiple benchmark tasks, and comprehensive ablation studies are conducted to validate the effectiveness of the design. **Specifically, the main contributions of the paper are as follows:** 1. Proposing the first 4D self-supervised cross-modal representation learning method that utilizes collaborative learning between image videos and point cloud videos. 2. Utilizing intra-modal and cross-modal contrastive learning to promote effective understanding of point cloud videos. 3. Comparing the features of the two modalities at different levels to further enhance representation capabilities. 4. Experimental results show that this method achieves significant improvements on multiple downstream tasks and provides detailed ablation studies to validate the effectiveness of the design. Through these methods, the paper aims to improve the accuracy and robustness of point cloud video understanding, with broad application prospects in fields such as autonomous driving, robotic navigation, and augmented reality.

CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Self-Supervised Intra-Modal and Cross-Modal Contrastive Learning for Point Cloud Understanding

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

Cross-view motion consistent self-supervised video inter-intra contrastive for action representation understanding

CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding

Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency

PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos

Cross-Architecture Self-supervised Video Representation Learning

Learning multi-view visual correspondences with self-supervision

GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

SegContrast: 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination

Unsupervised Cross-view Subspace Clustering via Adaptive Contrastive Learning

Cross-Modal Contrastive Learning for Domain Adaptation in 3D Semantic Segmentation.

Video Contrastive Learning with Global Context

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

TriCI: Triple Cross-Intra Branch Contrastive Learning for Point Cloud Analysis

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding