Abstract:Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the poor performance of existing self - supervised pre - training methods on autonomous driving datasets, especially the challenges encountered when dealing with unbalanced object categories and size distributions as well as complex scene geometric structures. Specifically: 1. **Unbalanced object categories and size distributions**: In autonomous driving scenarios, the distribution of object categories and sizes is often a long - tailed distribution (for example, small objects such as motorcycles and pedestrians appear less frequently), while existing self - supervised pre - training methods (such as DINO and CrIBo) assume that the object category and size distributions are uniform, which leads to unsatisfactory results when they process these data. 2. **Complex scene geometric structures**: Autonomous driving scenarios contain a large number of background areas and objects of different sizes, which place higher requirements on the model's spatial clustering ability. Existing methods fail to fully utilize the geometric information of the scene (such as depth information), thus affecting the model's ability to recognize small objects and occluded objects. To solve these problems, the authors propose S3PT (Scene Semantics and Structure Guided Clustering), a new scene - semantics - and - structure - guided clustering method, aiming to provide more consistent scene targets to improve self - supervised pre - training. The specific contributions of S3PT include: - **Clustering with consistent semantic distribution**: By using the vMF normalization formula, it encourages better representation of rare categories (such as motorcycles or animals) and adapts to long - tailed distribution data. \[ P_k(z)=\frac{C\left(\frac{\|W_k\|}{\tau}\right)\exp\left(\frac{\langle W_k, z\rangle}{\tau}\right)}{\sum_{j = 1}^K C\left(\frac{\|W_j\|}{\tau}\right)\exp\left(\frac{\langle W_j, z\rangle}{\tau}\right)} \] - **Spatial clustering with consistent object diversity**: By relaxing the assumption of clustering uniformity and increasing the number of clusters, it deals with unbalanced and diverse object sizes, from large background areas to small objects (such as pedestrians and traffic signs). - **Depth - guided spatial clustering**: Depth information is introduced to regularize learning, and based on the geometric information of the scene, the feature - level region separation is further refined, so as to better handle 3D perception tasks and accurate segmentation of occluded objects. Through these improvements, S3PT significantly improves the performance of downstream semantic segmentation and 3D object detection tasks and shows good domain transfer ability.

S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving

Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering

Mutual Information-Driven Self-Supervised Point Cloud Pre-Training

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Sample, Crop, Track: Self-Supervised Mobile 3D Object Detection for Urban Driving LiDAR

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

Improving Point Cloud Semantic Segmentation by Learning 3D Object Detection

Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data

RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

SGRec3D: Self-Supervised 3D Scene Graph Learning via Object-Level Scene Reconstruction

Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection

Towards 3D Semantic Scene Completion for Autonomous Driving: A Meta-Learning Framework Empowered by Deformable Large-Kernel Attention and Mamba Model

Plugging Self-Supervised Monocular Depth into Unsupervised Domain Adaptation for Semantic Segmentation

Multi-Object RANSAC: Efficient Plane Clustering Method in a Clutter