Abstract:Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at <a class="link-external link-https" href="https://github.com/markusschoen/MGNiceNet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to combine panoptic segmentation and self - supervised monocular depth estimation to achieve real - time applications in autonomous driving vehicles. Specifically, the authors propose MGNiceNet, a unified method aimed at overcoming two major limitations of existing methods in autonomous driving systems: 1. **Supervised training problems in depth estimation**: Most existing depth estimation methods rely on stereo disparity or projected lidar points as ground - truth labels for supervised training. This method has limited generalization ability, and data acquisition for large - scale depth datasets is both time - consuming and requires good sensor calibration. 2. **Trade - off between inference speed and accuracy**: Many existing methods focus more on high accuracy and neglect fast inference speed, which is unacceptable for autonomous driving systems that need to process incoming images with low latency. To solve these problems, MGNiceNet introduces a method based on the linked kernel formula for panoptic segmentation and self - supervised depth estimation. In addition, a panoptic - guided motion masking technique is introduced to improve the accuracy of depth estimation without relying on video panoptic segmentation annotations. Through these improvements, MGNiceNet can achieve real - time processing while maintaining high accuracy, thus being suitable for autonomous driving scenarios. ### Formula summary - **Self - supervised loss function for depth estimation**: \[ L_d=\lambda_{\text{phot}}L_{\text{phot}}+\lambda_{\text{smooth}}L_{\text{smooth}} \] where \(L_{\text{phot}}\) is the photometric loss and \(L_{\text{smooth}}\) is the smooth loss. - **Photometric loss calculation**: \[ L_{\text{phot}}(I(t),\hat{I}(t)_i)=\alpha(1 - \text{SSIM}(I(t),\hat{I}(t)_i))^2+(1 - \alpha)|I(t)-\hat{I}(t)_i| \] The structural similarity index (SSIM) is used to measure the similarity between the reference frame and the deformed context frame. - **Final loss function**: \[ L = L_p+\lambda_{\text{depth}}L_d \] where \(L_p\) is the panoptic segmentation loss and \(\lambda_{\text{depth}}\) is the balancing parameter. Through these methods, MGNiceNet not only improves the performance of the model but also solves the limitations of existing methods in autonomous driving applications.

MGNiceNet: Unified Monocular Geometric Scene Understanding

A Robust Monocular Depth Estimation Framework Based on Light-Weight ERF-Pspnet for Day-Night Driving Scenes

PADENet: an Efficient and Robust Panoramic Monocular Depth Estimation Network for Outdoor Scenes.

MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications

GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose

MDS-Net: Multi-Scale Depth Stratification 3D Object Detection from Monocular Images

FSNet: Redesign Self-Supervised MonoDepth for Full-Scale Depth Prediction for Autonomous Driving

PanDepth: Joint Panoptic Segmentation and Depth Completion

NeRFmentation: NeRF-based Augmentation for Monocular Depth Estimation

Real-Time Monocular Joint Perception Network for Autonomous Driving

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to Depth-aware Video Panoptic Segmentation

Ground-aware Monocular 3D Object Detection for Autonomous Driving

On Deep Learning Techniques to Boost Monocular Depth Estimation for Autonomous Navigation

NDNet: Spacewise Multiscale Representation Learning via Neighbor Decoupling for Real-Time Driving Scene Parsing

Self-supervised monocular depth estimation via joint attention and intelligent mask loss

Unified Perception: Efficient Depth-Aware Video Panoptic Segmentation with Minimal Annotation Costs

PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving

AggNet for Self-supervised Monocular Depth Estimation: Go an Aggressive Step Furthe.

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection