Abstract:Recently, Depth Anything Model (DAM) - a type of depth foundation model - reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. Möbius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing depth estimation models (such as the Depth Anything Model, DAM) have poor performance when processing 360 - degree images. Specifically, 360 - degree images have a large field of view (180°×360°) but suffer from spherical distortion, which makes it difficult for existing models to effectively process these images. In addition, 360 - degree datasets are usually limited to specific scenes (such as indoor scenes), lacking in diversity and large - scale labeled data, resulting in insufficient generalization ability of the models. To solve these problems, the paper proposes the following objectives: 1. **Establish a benchmark test**: Evaluate the performance of DAM on 360 - degree images, covering different 360 - degree representations, spatial transformations, and diverse indoor and outdoor scenes. 2. **Develop a powerful 360 - degree depth estimation model**: By collecting a large - scale unlabeled dataset and proposing a semi - supervised learning framework (Any360D) to improve the generalization ability and robustness of the model. ### Specific problems and solutions #### 1. Representation of 360 - degree images - **Problem**: Different representations have a significant impact on the performance of the model. For example, equirectangular projection (ERP), cube maps, tangent patches, etc., each have their own advantages and disadvantages. - **Solution**: It was found through experiments that the ERP representation performs best without post - processing, while other representations can recover local details but require additional post - processing steps. #### 2. Robustness of spatial transformation - **Problem**: DAM is sensitive to spatial transformations (such as vertical rotation and scaling), and its performance drops sharply especially under the scaling operation. - **Solution**: Introduce the Möbius - transformed spatial augmentation (MTSA) to improve the robustness of the model to various spatial transformations through consistency regularization. #### 3. Diverse scenes - **Problem**: Existing 360 - degree datasets are mainly concentrated on indoor scenes, resulting in poor performance of the model in outdoor scenes. - **Solution**: Collect a large - scale unlabeled dataset containing diverse indoor and outdoor scenes and use a semi - supervised learning framework for training to improve the generalization ability of the model in different scenes. #### 4. Choice of optimization space - **Problem**: Traditional disparity - supervised methods are not effective in far - distance areas and on small objects. - **Solution**: Adopt a metric - depth - based supervision method, especially improved for the structural details in the equatorial region. #### 5. Influence of model size - **Problem**: Although larger backbone models (such as ViT - L) perform better in some scenes, there are still problems of missing or blurry structural details in the equatorial region. - **Solution**: Fine - tune the DAM encoder through low - rank adaptation (LoRA) and combine it with a semi - supervised learning framework to improve the overall performance of the model. ### Summary The main contributions of the paper include: 1. Establishing the first comprehensive benchmark test for evaluating the performance of DAM on 360 - degree images. 2. Proposing a semi - supervised learning framework Any360D, which uses large - scale unlabeled data and Möbius - transformed spatial augmentation to improve the generalization ability and robustness of the model. 3. The experimental results show that Any360D performs well in various spatial transformations and diverse scenes and has an impressive zero - shot ability. Through these improvements, the paper effectively solves the challenges encountered by existing depth estimation models when processing 360 - degree images and provides new research directions and technical means for the field of 360 - degree depth estimation.

Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation

Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

MODE: Multi-view Omnidirectional Depth Estimation with 360-degree Cameras

Dream360: Diverse and Immersive Outdoor Virtual Scene Creation via Transformer-Based 360 Image Outpainting

Distortion-Adaptive Salient Object Detection in 360° Omnidirectional Images.

Depth Anything V2

Multi-source Domain Adaptation for Panoramic Semantic Segmentation

Omnidirectional Depth Estimation for Semantic Segmentation

Distortion-adaptive Salient Object Detection in 360$^\circ$ Omnidirectional Images

Amodal Depth Anything: Amodal Depth Estimation in the Wild

360MonoDepth: High-Resolution 360° Monocular Depth Estimation

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

MegaDepth: Learning Single-View Depth Prediction from Internet Photos

OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion

360SFUDA++: Towards Source-free UDA for Panoramic Segmentation by Learning Reliable Category Prototypes

Distortion-aware Transformer in 360° Salient Object Detection

SDGE: Stereo Guided Depth Estimation for 360$^\circ$ Camera Sets

Imagine360: Immersive 360 Video Generation from Perspective Anchor

MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas

CUBE360: Learning Cubic Field Representation for Monocular 360 Depth Estimation for Virtual Reality