Abstract:This dissertation is a multifaceted contribution to the advancement of vision-based 3D perception technologies. In the first segment, the thesis introduces structural enhancements to both monocular and stereo 3D object detection algorithms. By integrating ground-referenced geometric priors into monocular detection models, this research achieves unparalleled accuracy in benchmark evaluations for monocular 3D detection. Concurrently, the work refines stereo 3D detection paradigms by incorporating insights and inferential structures gleaned from monocular networks, thereby augmenting the operational efficiency of stereo detection systems. The second segment is devoted to data-driven strategies and their real-world applications in 3D vision detection. A novel training regimen is introduced that amalgamates datasets annotated with either 2D or 3D labels. This approach not only augments the detection models through the utilization of a substantially expanded dataset but also facilitates economical model deployment in real-world scenarios where only 2D annotations are readily available. Lastly, the dissertation presents an innovative pipeline tailored for unsupervised depth estimation in autonomous driving contexts. Extensive empirical analyses affirm the robustness and efficacy of this newly proposed pipeline. Collectively, these contributions lay a robust foundation for the widespread adoption of vision-based 3D perception technologies in autonomous driving applications.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are **improving the performance and scalability of vision - based 3D object detection and monocular depth estimation in autonomous driving**. Specifically, the author addresses these issues in the following aspects: 1. **Improving the accuracy of vision - based 3D object detection**: - **Monocular 3D object detection**: A new network module design is introduced, especially by combining geometric prior information such as Ground - Aware Convolution to improve the detection accuracy. - **Binocular 3D object detection**: The multi - scale stereo feature extraction and fusion structure is improved, which enhances the efficiency and accuracy of stereo matching. 2. **Solving the problem of data annotation dependence**: - A joint training method is proposed, which allows the use of datasets with only 2D annotations to train 3D detection models, thereby reducing the dependence on expensive LiDAR - annotated data and enabling effective 3D detection on new datasets with only 2D annotations. 3. **Achieving unsupervised monocular depth estimation**: - An innovative unsupervised monocular depth prediction framework is proposed, which uses techniques such as self - distillation to perform depth estimation without depth labels, further reducing the need for annotated data. ### Specific problem description - **Accuracy problem**: Due to the lack of direct depth measurement (such as that provided by LiDAR), traditional multi - view triangulation techniques are not reliable enough in dynamic driving environments. Meanwhile, deep - learning models need to perform semantic reasoning and geometric interpretation simultaneously, which poses a challenge to the accuracy of the system. \[ \text{Depth Prediction} = f(\text{Image Input}) \] - **Real - time deployment problem**: To ensure that these complex deep neural networks can run on resource - constrained mobile platforms, the model architecture must be simplified and the computational efficiency optimized. - **Data utilization problem**: Existing deep - learning frameworks usually rely on datasets containing LiDAR and camera inputs, which makes it difficult to train or fine - tune using only the data obtained from camera platforms. If the pre - trained model cannot generalize well to new environments, its practical application will be limited. Through the above methods, the paper aims to lay a solid foundation for the wide application of vision - based 3D perception technology in autonomous driving.

Scalable Vision-Based 3D Object Detection and Monocular Depth Estimation for Autonomous Driving

Ground-aware Monocular 3D Object Detection for Autonomous Driving

3D Object Detection for Autonomous Driving: A Survey

MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection

Vision-Based Environmental Perception for Autonomous Driving

A survey on 3D object detection in real time for autonomous driving

MP-Mono: Monocular 3D Detection Using Multiple Priors for Autonomous Driving

Pseudo-Mono for Monocular 3D Object Detection in Autonomous Driving

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

Monocular 3D Object Detection With Sequential Feature Association and Depth Hint Augmentation

Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

3D Object Detection from Images for Autonomous Driving: A Survey

3D Object Detection for Autonomous Driving: A Comprehensive Survey

Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving

A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

SGM3D: Stereo Guided Monocular 3D Object Detection

Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

Monocular Visual Object 3D Localization in Road Scenes

Monocular 3D lane detection for Autonomous Driving: Recent Achievements, Challenges, and Outlooks