Abstract:As a flexible passive 3D sensing means, unsupervised learning of depth from monocular videos is becoming an important research topic. It utilizes the photometric errors between the target view and the synthesized views from its adjacent source views as the loss instead of the difference from the ground truth. Occlusion and scene dynamics in real-world scenes still adversely affect the learning, despite significant progress made recently. In this paper, we show that deliberately manipulating photometric errors can efficiently deal with these difficulties better. We first propose an outlier masking technique that considers the occluded or dynamic pixels as statistical outliers in the photometric error map. With the outlier masking, the network learns the depth of objects that move in the opposite direction to the camera more accurately. To the best of our knowledge, such cases have not been seriously considered in the previous works, even though they pose a high risk in applications like autonomous driving. We also propose an efficient weighted multi-scale scheme to reduce the artifacts in the predicted depth maps. Extensive experiments on the KITTI dataset and additional experiments on the Cityscapes dataset have verified the proposed approach's effectiveness on depth or ego-motion estimation. Furthermore, for the first time, we evaluate the predicted depth on the regions of dynamic objects and static background separately for both supervised and unsupervised methods. The evaluation further verifies the effectiveness of our proposed technical approach and provides some interesting observations that might inspire future research in this direction.

Learning Features by Watching Objects Move

Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos

Discovering Objects that Can Move

Unsupervised Visual Representation Learning by Tracking Patches in Video

Unsupervised Learning of Object Keypoints for Perception and Control

Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns

Learning Visual Features Under Motion Invariance

Online Unsupervised Feature Learning for Visual Tracking

Learning 3D object-centric representation through prediction

Learning a Deep Compact Image Representation for Visual Tracking

Tracking Without Re-recognition in Humans and Machines

Learning Video Object Segmentation with Visual Memory

Unsupervised Monocular Depth Perception: Focusing on Moving Objects

Motion-inductive Self-supervised Object Discovery in Videos

Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking

Unsupervised Learning of Depth Estimation, Camera Motion Prediction and Dynamic Object Localization from Video

Learning to Track Objects from Unlabeled Videos.

Learning for Scalable Multimedia Representation.

Discovering objects and their location in videos using spatial-temporal context words

Unsupervised Learning of Long-Term Motion Dynamics for Videos

Unsupervised Spatio-temporal Latent Feature Clustering for Multiple-object Tracking and Segmentation