Abstract:Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **to improve the generalization ability of single - image indoor depth estimation models on unseen datasets**. Specifically, the author focuses on how to make the model perform better in zero - shot cross - dataset inference, especially when facing complex and changeable indoor scenes. ### Main problem background 1. **Indoor robot task requirements**: Indoor robots rely on depth information to perform tasks such as navigation and obstacle avoidance, and single - image depth estimation is an important means to achieve these tasks. 2. **Limitations of existing methods**: Most existing single - image depth prediction methods are trained and tested on specific datasets and lack robustness and generalization ability to unseen datasets. Especially for indoor scenes, due to the large changes in object arrangement and scene composition, it is difficult for the model to maintain high performance in new environments. ### Core contributions of the paper To solve the above problems, the author proposes the following innovations: 1. **Apply meta - learning to pure single - image depth prediction for the first time**: By introducing meta - initialization, the model can obtain better generalization performance with limited data and resources, without the need for multiple training datasets, auxiliary information or other pre - trained networks. 2. **Fine - grained task concept**: In view of the challenge that there are no clear task boundaries in the pure single - image setting, it is proposed to regard each mini - batch as a fine - grained task, thus overcoming the task definition problem in traditional meta - learning. 3. **Zero - shot cross - dataset evaluation protocol**: A new set of evaluation protocols is designed to verify the generalization ability of the model between different datasets, ensuring the robustness and reliability of the model in practical applications. ### Method overview The author adopts gradient - based meta - learning, combining a meta - optimizer and a basic optimizer. The improvement is achieved through the following steps: - **Meta - learning stage**: In each meta - iteration, a mini - batch is sampled from the entire training set as a fine - grained task, and the gradient direction is explored multiple times through the inner loop to minimize the regression loss. At the same time, online augmentation and task augmentation techniques (such as mix - up and channel shuffle) are used to prevent overfitting and improve generalization ability. - **Supervised learning stage**: The prior weights obtained by meta - learning are used to initialize the subsequent supervised learning, further optimizing the model parameters, and finally obtaining more accurate depth estimation results. ### Experimental results Experiments show that the meta - initialization method can significantly improve the model performance, whether on datasets with limited scene diversity or in cross - dataset inference, especially when dealing with complex indoor structures. In addition, this method can also be used as a plug - in in a variety of existing depth estimation frameworks, further enhancing its wide applicability. In conclusion, this paper successfully solves the generalization problem of single - image depth estimation on unseen datasets by introducing the concepts of meta - learning and fine - grained tasks, providing strong support for indoor robot perception and other related fields.

Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Unsupervised Learning

MetaComp: Learning to Adapt for Online Depth Completion

Towards Zero-Shot Scale-Aware Monocular Depth Estimation

Meta-Transfer Networks for Zero-Shot Learning

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

Lifelong-MonoDepth: Lifelong Learning for Multi-Domain Monocular Metric Depth Estimation

Learn to Adapt for Self-Supervised Monocular Depth Estimation

GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation Adaptation

Do More With What You Have: Transferring Depth-Scale from Labeled to Unlabeled Domains

GenDepth: Generalizing Monocular Depth Estimation for Arbitrary Camera Parameters via Ground Plane Embedding

Boosting Monocular Depth Estimation with Sparse Guided Points

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data

EndoOmni: Zero-Shot Cross-Dataset Depth Estimation in Endoscopy by Robust Self-Learning from Noisy Labels

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image