Rethinking Training Objective for Self-Supervised Monocular Depth Estimation - Semantic Cues to Rescue.

Keyao Li,Ge Li,Thomas Li
DOI: https://doi.org/10.1109/icip42928.2021.9506744
2021-01-01
Abstract:Monocular depth estimation finds a wide range of applications in modeling 3D scenes. Since it is expensive to collect ground truth labels to supervise training, plenty of works have been done in a self-supervised manner. A common practice is to train the network optimizing a photometric objective (i.e., view synthesis) due to its effectiveness. However, this training objective is sensitive to optical changes and lacks a consideration of object-level cues, which leads to sub-optimal results in some cases, e.g., artifacts in complex regions and depth discontinuities around thin structures. We summarize them as depth ambiguities. In this paper, we propose an easy yet effective architecture, introducing semantic cues into supervision to solve problems mentioned above. First through our study on the problems we figure out that they are due to the limitation of the commonly applied photometric reconstruction training objective. Then we come up with our method using semantic cues to encode the geometry constraint behind view synthesis. The proposed novel objective is more credible towards confusing pixels, also takes an object-level perception. Experiments show that without introducing extra inference complexity, our method alleviates depth ambiguities greatly and performs comparably with state-of-the-art methods on KITTI benchmark.
What problem does this paper attempt to address?