Robust Bird's Eye View Segmentation by Adapting DINOv2

Merve Rabia Barın,Görkay Aydemir,Fatma Güney
2024-09-16
Abstract:Extracting a Bird's Eye View (BEV) representation from multiple camera images offers a cost-effective, scalable alternative to LIDAR-based solutions in autonomous driving. However, the performance of the existing BEV methods drops significantly under various corruptions such as brightness and weather changes or camera failures. To improve the robustness of BEV perception, we propose to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA). Our approach builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV. Our experiments show increased robustness of BEV perception under various corruptions, with increasing gains from scaling up the model and the input resolution. We also showcase the effectiveness of the adapted representations in terms of fewer learnable parameters and faster convergence during training.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of extracting Bird’s Eye View (BEV) representations from multi-camera images in autonomous driving scenarios. Specifically, the goal of the paper is to improve the robustness of BEV perception under various interference conditions. Existing methods experience significant performance degradation when faced with disturbances such as brightness changes, weather variations, or camera failures. This paper improves the robustness of BEV estimation by combining the large-scale visual foundation model DINOv2 with Low-Rank Adaptation (LoRA) technology. The authors integrated DINOv2 into the SimpleBEV framework and used LoRA technology for efficient model adjustment. Experimental results show that this adaptation method not only improves the accuracy of BEV perception under different interference conditions but also reduces the number of learnable parameters and shortens training time. Additionally, the paper demonstrates that the proposed method maintains high performance even with lower input resolution, further proving its efficiency and robustness. Overall, the study validates the effectiveness and feasibility of using large general-purpose visual models to enhance BEV segmentation tasks.