MGNiceNet: Unified Monocular Geometric Scene Understanding

Markus Schön,Michael Buchholz,Klaus Dietmayer
2024-11-18
Abstract:Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at <a class="link-external link-https" href="https://github.com/markusschoen/MGNiceNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to combine panoptic segmentation and self - supervised monocular depth estimation to achieve real - time applications in autonomous driving vehicles. Specifically, the authors propose MGNiceNet, a unified method aimed at overcoming two major limitations of existing methods in autonomous driving systems: 1. **Supervised training problems in depth estimation**: Most existing depth estimation methods rely on stereo disparity or projected lidar points as ground - truth labels for supervised training. This method has limited generalization ability, and data acquisition for large - scale depth datasets is both time - consuming and requires good sensor calibration. 2. **Trade - off between inference speed and accuracy**: Many existing methods focus more on high accuracy and neglect fast inference speed, which is unacceptable for autonomous driving systems that need to process incoming images with low latency. To solve these problems, MGNiceNet introduces a method based on the linked kernel formula for panoptic segmentation and self - supervised depth estimation. In addition, a panoptic - guided motion masking technique is introduced to improve the accuracy of depth estimation without relying on video panoptic segmentation annotations. Through these improvements, MGNiceNet can achieve real - time processing while maintaining high accuracy, thus being suitable for autonomous driving scenarios. ### Formula summary - **Self - supervised loss function for depth estimation**: \[ L_d=\lambda_{\text{phot}}L_{\text{phot}}+\lambda_{\text{smooth}}L_{\text{smooth}} \] where \(L_{\text{phot}}\) is the photometric loss and \(L_{\text{smooth}}\) is the smooth loss. - **Photometric loss calculation**: \[ L_{\text{phot}}(I(t),\hat{I}(t)_i)=\alpha(1 - \text{SSIM}(I(t),\hat{I}(t)_i))^2+(1 - \alpha)|I(t)-\hat{I}(t)_i| \] The structural similarity index (SSIM) is used to measure the similarity between the reference frame and the deformed context frame. - **Final loss function**: \[ L = L_p+\lambda_{\text{depth}}L_d \] where \(L_p\) is the panoptic segmentation loss and \(\lambda_{\text{depth}}\) is the balancing parameter. Through these methods, MGNiceNet not only improves the performance of the model but also solves the limitations of existing methods in autonomous driving applications.