BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment

Mehdi Hosseinzadeh,Ian Reid
2024-10-28
Abstract:In the field of autonomous driving and mobile robotics, there has been a significant shift in the methods used to create Bird's Eye View (BEV) representations. This shift is characterised by using transformers and learning to fuse measurements from disparate vision sensors, mainly lidar and cameras, into a 2D planar ground-based representation. However, these learning-based methods for creating such maps often rely heavily on extensive annotated data, presenting notable challenges, particularly in diverse or non-urban environments where large-scale datasets are scarce. In this work, we present BEVPose, a framework that integrates BEV representations from camera and lidar data, using sensor pose as a guiding supervisory signal. This method notably reduces the dependence on costly annotated data. By leveraging pose information, we align and fuse multi-modal sensory inputs, facilitating the learning of latent BEV embeddings that capture both geometric and semantic aspects of the environment. Our pretraining approach demonstrates promising performance in BEV map segmentation tasks, outperforming fully-supervised state-of-the-art methods, while necessitating only a minimal amount of annotated data. This development not only confronts the challenge of data efficiency in BEV representation learning but also broadens the potential for such techniques in a variety of domains, including off-road and indoor environments.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of relying on labeled data when creating Bird - Eye - View (BEV) representations in the fields of autonomous driving and mobile robotics. Specifically, existing learning methods highly depend on a large amount of labeled data when generating BEV maps, which is especially obvious in diverse or non - urban environments because these areas lack large - scale high - quality data sets. For this reason, the paper proposes the BEVPose framework. By using sensor poses as supervision signals and integrating data from cameras and lidars, it significantly reduces the dependence on expensive labeled data. In this way, BEVPose not only improves data efficiency but also outperforms fully - supervised methods in BEV map segmentation tasks while requiring only a small amount of labeled data. This progress not only addresses the data - efficiency challenges in BEV representation learning but also expands the application potential of such technologies in various fields, such as unpaved roads and indoor environments.