Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation

Zidong Cao,Jinjing Zhu,Weiming Zhang,Lin Wang
2024-06-19
Abstract:Recently, Depth Anything Model (DAM) - a type of depth foundation model - reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. Möbius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing depth estimation models (such as the Depth Anything Model, DAM) have poor performance when processing 360 - degree images. Specifically, 360 - degree images have a large field of view (180°×360°) but suffer from spherical distortion, which makes it difficult for existing models to effectively process these images. In addition, 360 - degree datasets are usually limited to specific scenes (such as indoor scenes), lacking in diversity and large - scale labeled data, resulting in insufficient generalization ability of the models. To solve these problems, the paper proposes the following objectives: 1. **Establish a benchmark test**: Evaluate the performance of DAM on 360 - degree images, covering different 360 - degree representations, spatial transformations, and diverse indoor and outdoor scenes. 2. **Develop a powerful 360 - degree depth estimation model**: By collecting a large - scale unlabeled dataset and proposing a semi - supervised learning framework (Any360D) to improve the generalization ability and robustness of the model. ### Specific problems and solutions #### 1. Representation of 360 - degree images - **Problem**: Different representations have a significant impact on the performance of the model. For example, equirectangular projection (ERP), cube maps, tangent patches, etc., each have their own advantages and disadvantages. - **Solution**: It was found through experiments that the ERP representation performs best without post - processing, while other representations can recover local details but require additional post - processing steps. #### 2. Robustness of spatial transformation - **Problem**: DAM is sensitive to spatial transformations (such as vertical rotation and scaling), and its performance drops sharply especially under the scaling operation. - **Solution**: Introduce the Möbius - transformed spatial augmentation (MTSA) to improve the robustness of the model to various spatial transformations through consistency regularization. #### 3. Diverse scenes - **Problem**: Existing 360 - degree datasets are mainly concentrated on indoor scenes, resulting in poor performance of the model in outdoor scenes. - **Solution**: Collect a large - scale unlabeled dataset containing diverse indoor and outdoor scenes and use a semi - supervised learning framework for training to improve the generalization ability of the model in different scenes. #### 4. Choice of optimization space - **Problem**: Traditional disparity - supervised methods are not effective in far - distance areas and on small objects. - **Solution**: Adopt a metric - depth - based supervision method, especially improved for the structural details in the equatorial region. #### 5. Influence of model size - **Problem**: Although larger backbone models (such as ViT - L) perform better in some scenes, there are still problems of missing or blurry structural details in the equatorial region. - **Solution**: Fine - tune the DAM encoder through low - rank adaptation (LoRA) and combine it with a semi - supervised learning framework to improve the overall performance of the model. ### Summary The main contributions of the paper include: 1. Establishing the first comprehensive benchmark test for evaluating the performance of DAM on 360 - degree images. 2. Proposing a semi - supervised learning framework Any360D, which uses large - scale unlabeled data and Möbius - transformed spatial augmentation to improve the generalization ability and robustness of the model. 3. The experimental results show that Any360D performs well in various spatial transformations and diverse scenes and has an impressive zero - shot ability. Through these improvements, the paper effectively solves the challenges encountered by existing depth estimation models when processing 360 - degree images and provides new research directions and technical means for the field of 360 - degree depth estimation.