Abstract:Scene understanding using multi-modal data is necessary in many applications, e.g., autonomous navigation. To achieve this in a variety of situations, existing models must be able to adapt to shifting data distributions without arduous data annotation. Current approaches assume that the source data is available during adaptation and that the source consists of paired multi-modal data. Both these assumptions may be problematic for many applications. Source data may not be available due to privacy, security, or economic concerns. Assuming the existence of paired multi-modal data for training also entails significant data collection costs and fails to take advantage of widely available freely distributed pre-trained uni-modal models. In this work, we relax both of these assumptions by addressing the problem of adapting a set of models trained independently on uni-modal data to a target domain consisting of unlabeled multi-modal data, without having access to the original source dataset. Our proposed approach solves this problem through a switching framework which automatically chooses between two complementary methods of cross-modal pseudo-label fusion -- agreement filtering and entropy weighting -- based on the estimated domain gap. We demonstrate our work on the semantic segmentation problem. Experiments across seven challenging adaptation scenarios verify the efficacy of our approach, achieving results comparable to, and in some cases outperforming, methods which assume access to source data. Our method achieves an improvement in mIoU of up to 12% over competing baselines. Our code is publicly available at <a class="link-external link-https" href="https://github.com/csimo005/SUMMIT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the problem of adapting a set of independently trained unimodal models to a target domain containing unlabelled multimodal data without access to the original source datasets. Specifically, the paper focuses on how to achieve this under the following conditions: 1. **Source data unavailable**: In many practical applications, the source datasets used for training are inaccessible due to privacy, security, or economic reasons. 2. **No paired multimodal data**: Existing methods often assume that the source datasets contain paired multimodal data, but this assumption is difficult to meet in real-world scenarios. ### Background and Motivation - **Multimodal scene understanding**: In many applications, such as autonomous navigation, it is necessary to utilize multimodal data (e.g., RGB images and point clouds) to improve the performance and robustness of scene understanding. - **Domain adaptation problem**: When the input data distribution is inconsistent with the training set distribution, the model's performance degrades, which is known as domain adaptation. This problem is particularly severe in autonomous navigation due to variations in lighting, weather, and geographical differences. - **Limitations of existing methods**: Existing cross-modal unsupervised domain adaptation (xMUDA) methods assume that the source datasets contain paired multimodal data and that these data can be accessed when adapting to the target domain. These assumptions may be difficult to meet in practical applications. ### Main Contributions of the Paper 1. **Problem definition**: The paper defines a new problem setting, which is to adapt a set of independently trained unimodal models to a target domain containing unlabelled multimodal data without access to the original source datasets. 2. **Proposed new framework**: The paper proposes a new cross-modal unsupervised domain adaptation framework (SUMMIT) that addresses the above problem through pseudo-label fusion techniques. Specifically, the framework includes the following steps: - **Generate pseudo-labels**: Use the trained unimodal models to generate pseudo-labels for the target data. - **Pseudo-label fusion**: Automatically select appropriate fusion strategies through two complementary methods (consistency filtering and entropy weighting) to reduce noisy predictions. - **Supervised learning**: Use the fused pseudo-labels to supervise the learning process of the model, thereby achieving cross-modal learning. 3. **Experimental validation**: The paper conducts extensive experiments on seven challenging benchmark datasets, showing that the proposed method outperforms existing baseline methods in some cases, with a maximum improvement of 12% in mIoU metric. ### Method Overview 1. **Problem setting**: The paper considers a set of independently trained unimodal models, each trained under supervision on a unique modality. After training, the source data is discarded. The target domain contains unlabelled paired multimodal data, and the goal is to adapt the source models to the new domain by leveraging the semantic relationships between modalities. 2. **Framework overview**: The paper adapts the source models through pseudo-label generation and fusion. Specifically, the framework includes two streams corresponding to 2D and 3D inputs. Each modality is processed by a separate feature encoder, and 2D features are sampled by projecting 3D points onto the corresponding RGB images. Four segmentation outputs include main predictions and modality conversion predictions. The main predictions are used to generate pseudo-labels through fusion strategies, which are then used to supervise the training of the model. 3. **Pseudo-label fusion**: The paper proposes two complementary pseudo-label fusion methods: - **Consistency filtering**: Retain consistent pseudo-labels by comparing pseudo-labels from different modalities. - **Entropy weighting**: Combine pseudo-labels using an information-theoretic approach, using the entropy of the output probabilities as weights. To further improve, a refinement method based on hypothesis testing is introduced, utilizing the statistical information of the target dataset to recover rejected pseudo-labels. 4. **Automatic switching**: Automatically select the appropriate fusion method based on the consistency rate between modalities. When the domain gap is small, the entropy weighting method performs better; when the domain gap is large, the consistency filtering method is more effective. ### Conclusion The paper proposes a new cross-modal unsupervised domain adaptation framework (SUMMIT), which successfully adapts unimodal models to a target domain containing unlabelled multimodal data without access to the original source datasets. Experimental results show that the proposed method achieves significant performance improvements on multiple benchmark datasets.

SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data

Multi-Source Domain Adaptation with Collaborative Learning for Semantic Segmentation

U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation

Multi-Modal Unsupervised Domain Adaptation for Semantic Image Segmentation

Multi-source Domain Adaptation for Semantic Segmentation

Cross-Modal Learning for Domain Adaptation in 3D Semantic Segmentation

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

Unsupervised Domain Adaptation Multi-Level Adversarial Network for Semantic Segmentation Based on Multi-Modal Features

MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation

Transferring Multi-Modal Domain Knowledge to Uni-Modal Domain for Urban Scene Segmentation

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

A Multi-Grained Unsupervised Domain Adaptation Approach for Semantic Segmentation

Adversarial unsupervised domain adaptation for 3D semantic segmentation with multi-modal learning

Source-Free Domain Adaptation for RGB-D Semantic Segmentation with Vision Transformers