Abstract:Unsupervised Domain Adaptation has been an efficient approach to transferring the semantic segmentation model across data distributions. Meanwhile, the recent Open-vocabulary Semantic Scene understanding based on large-scale vision language models is effective in open-set settings because it can learn diverse concepts and categories. However, these prior methods fail to generalize across different camera views due to the lack of cross-view geometric modeling. At present, there are limited studies analyzing cross-view learning. To address this problem, we introduce a novel Unsupervised Cross-view Adaptation Learning approach to modeling the geometric structural change across views in Semantic Scene Understanding. First, we introduce a novel Cross-view Geometric Constraint on Unpaired Data to model structural changes in images and segmentation masks across cameras. Second, we present a new Geodesic Flow-based Correlation Metric to efficiently measure the geometric structural changes across camera views. Third, we introduce a novel view-condition prompting mechanism to enhance the view-information modeling of the open-vocabulary segmentation network in cross-view adaptation learning. The experiments on different cross-view adaptation benchmarks have shown the effectiveness of our approach in cross-view modeling, demonstrating that we achieve State-of-the-Art (SOTA) performance compared to prior unsupervised domain adaptation and open-vocabulary semantic segmentation methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **the problem of modeling geometric structure changes in cross - view semantic scene understanding**. Specifically, the existing Unsupervised Domain Adaptation (UDA) methods and open - vocabulary semantic segmentation models have poor generalization ability between different camera views, especially in the transition from the car view to the drone view. Due to the lack of cross - view geometric modeling, the performance is not good. ### Core of the problem 1. **Cross - view generalization problem**: The existing UDA methods and open - vocabulary semantic segmentation models cannot well handle the structural differences between different camera views, especially in the change from the ground view (such as the car view) to the high - altitude view (such as the drone view). 2. **Geometric structure change**: The geometric structures of images and segmentation masks under different views will change significantly, and the existing methods fail to effectively model these changes. 3. **High labeling cost**: Labeling data from the drone view is very expensive and time - consuming. Therefore, a method that can perform cross - view adaptation under unsupervised or weakly - supervised conditions is required. ### Solutions proposed in the paper To address the above problems, the paper proposes a new method named **EAGLE (Efficient Adaptive Geometry - based Learning)** for unsupervised cross - view adaptation learning. EAGLE mainly solves the cross - view generalization problem through the following three innovative aspects: 1. **Cross - view geometric constraint**: - A new cross - view geometric constraint is introduced to model the structural changes of images and segmentation masks on unpaired data. - The formula is expressed as: \[ D_x(x_s,\bar{x}_t)=\alpha D_y(y_s,\bar{y}_t) \] where \(D_x\) and \(D_y\) are measures for measuring the cross - view structural changes of images and segmentation masks respectively, and \(\alpha\) is a scale factor. 2. **Geodesic flow correlation measure**: - A new correlation measure based on the geodesic flow path is proposed to efficiently measure the geometric structure changes between different views. - The geodesic flow path can capture the structural changes of images and segmentation masks between different views. The formula is: \[ g(x_s,x_t)=\int_0^1 x_s^{\top}\Pi(\nu)\Pi(\nu)^{\top}x_t d\nu \] 3. **View - conditional prompt mechanism**: - A new view - conditional prompt mechanism is introduced to enhance the view - information - modeling ability of the open - vocabulary segmentation network in cross - view adaptation learning. - By adding view information (such as "captured from the [domain] view") to the prompt, the effect of visual context learning is improved. ### Experimental results The paper conducted experiments on multiple cross - view adaptation benchmarks (SYNTHIA → UAVID, GTA → UAVID, BDD → UAVID). The results show that the EAGLE method performs excellently in cross - view modeling and reaches the state - of - the - art (SOTA). In particular, by introducing the view - conditional prompt mechanism, the mIoU performance is further improved. In summary, by introducing cross - view geometric constraints, geodesic flow correlation measures, and view - conditional prompt mechanisms, this paper successfully solves the problem of modeling geometric structure changes in cross - view semantic scene understanding and significantly improves the generalization ability of the model under different views.

EAGLE: Efficient Adaptive Geometry-based Learning in Cross-view Understanding

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation

Domain Adaptation on Point Clouds Via Geometry-Aware Implicits

Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

CROVIA: Seeing Drone Scenes from Car Perspective via Cross-View Adaptation

BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation

Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation

A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models

Cross-Domain Scene Unsupervised Learning Segmentation with Dynamic Subdomains

Cross-Modal Learning for Domain Adaptation in 3D Semantic Segmentation

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Adaptation in Semantic Segmentation

DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception

Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data

Unsupervised Domain Adaptation for Referring Semantic Segmentation

Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

Multiview Cross-supervision for Semantic Segmentation

Cross-Modal Contrastive Learning for Domain Adaptation in 3D Semantic Segmentation.

GeoNet: Benchmarking Unsupervised Adaptation across Geographies

Cross-view Geo-localization via Learning Disentangled Geometric Layout Correspondence

Learning Cross-View Visual Geo-Localization Without Ground Truth

Cross-View Feature Learning for Scalable Social Image Analysis.