Abstract:Spatial scene similarity plays a crucial role in spatial cognition, as it enables us to understand and compare different spatial scenes and their relationships. However, understanding spatial scenes is a complex task. While existing literature has contributed to spatial scene representation learning, these methods primarily focus on comprehending the spatial relationships among objects, often neglecting their semantic features. Furthermore, there is a lack of scene representation learning methods that can seamlessly handle different types of spatial objects (e.g., points, polylines, and polygons) in a scene. Moreover, since expert knowledge is required for the annotation process of spatial scene understanding, publicly available high-quality annotation data has a limited size which usually leads to suboptimal results. To address these issues, we propose a novel multi-scale spatial scene encoding model called SpatialScene2Vec. SpatialScene2Vec utilizes a point location encoder to seamlessly encode the spatial information of different types of spatial objects. A point feature encoder is employed to encode the semantic features of these objects. A spatial scene embedding is generated by integrating the spatial embeddings and feature embeddings of spatial objects within this scene. Furthermore, to address the limited labeled data problem, we propose a self-supervised learning framework to train the SpatialScene2Vec model in which a contrastive loss is used for spatial scene similarity evaluation. In addition, we introduce a novel spatial scene data augmentation method to generate positive scene augmentations by leveraging the unique characteristics of spatial scenes and random sampling points based on the shapes of polyline/polygon objects within the current spatial scenes. We conduct experiments on real-world datasets for spatial scene retrieval tasks, including vector data types of points, polylines, and polygons. Results show that SpatialScene2Vec outperforms well-established encoding methods such as Space2Vec due to the advantages of the integrated multi-scale representations and the proposed spatial scene data augmentation method, with significant improvements and robustness.

VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

SegContrast: 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation

Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving

ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Enhancing scene understanding based on deep learning for end-to-end autonomous driving

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

SpatialScene2Vec: A self-supervised contrastive representation learning method for spatial scene similarity evaluation

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

VLP: Vision Language Planning for Autonomous Driving

Open 3D World in Autonomous Driving

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving