S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving

Maciej K. Wozniak,Hariprasath Govindarajan,Marvin Klingner,Camille Maurice,Ravi Kiran,Senthil Yogamani
2024-10-30
Abstract:Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the poor performance of existing self - supervised pre - training methods on autonomous driving datasets, especially the challenges encountered when dealing with unbalanced object categories and size distributions as well as complex scene geometric structures. Specifically: 1. **Unbalanced object categories and size distributions**: In autonomous driving scenarios, the distribution of object categories and sizes is often a long - tailed distribution (for example, small objects such as motorcycles and pedestrians appear less frequently), while existing self - supervised pre - training methods (such as DINO and CrIBo) assume that the object category and size distributions are uniform, which leads to unsatisfactory results when they process these data. 2. **Complex scene geometric structures**: Autonomous driving scenarios contain a large number of background areas and objects of different sizes, which place higher requirements on the model's spatial clustering ability. Existing methods fail to fully utilize the geometric information of the scene (such as depth information), thus affecting the model's ability to recognize small objects and occluded objects. To solve these problems, the authors propose S3PT (Scene Semantics and Structure Guided Clustering), a new scene - semantics - and - structure - guided clustering method, aiming to provide more consistent scene targets to improve self - supervised pre - training. The specific contributions of S3PT include: - **Clustering with consistent semantic distribution**: By using the vMF normalization formula, it encourages better representation of rare categories (such as motorcycles or animals) and adapts to long - tailed distribution data. \[ P_k(z)=\frac{C\left(\frac{\|W_k\|}{\tau}\right)\exp\left(\frac{\langle W_k, z\rangle}{\tau}\right)}{\sum_{j = 1}^K C\left(\frac{\|W_j\|}{\tau}\right)\exp\left(\frac{\langle W_j, z\rangle}{\tau}\right)} \] - **Spatial clustering with consistent object diversity**: By relaxing the assumption of clustering uniformity and increasing the number of clusters, it deals with unbalanced and diverse object sizes, from large background areas to small objects (such as pedestrians and traffic signs). - **Depth - guided spatial clustering**: Depth information is introduced to regularize learning, and based on the geometric information of the scene, the feature - level region separation is further refined, so as to better handle 3D perception tasks and accurate segmentation of occluded objects. Through these improvements, S3PT significantly improves the performance of downstream semantic segmentation and 3D object detection tasks and shows good domain transfer ability.