Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation

Haozhi Cao,Yuecong Xu,Jianfei Yang,Pengyu Yin,Xingyu Ji,Shenghai Yuan,Lihua Xie
2024-07-25
Abstract:Multi-modal test-time adaptation (MM-TTA) is proposed to adapt models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. Previous MM-TTA methods for 3D segmentation rely on predictions of cross-modal information in each input frame, while they ignore the fact that predictions of geometric neighborhoods within consecutive frames are highly correlated, leading to unstable predictions across time. To fulfill this gap, we propose ReLiable Spatial-temporal Voxels (Latte), an MM-TTA method that leverages reliable cross-modal spatial-temporal correspondences for multi-modal 3D segmentation. Motivated by the fact that reliable predictions should be consistent with their spatial-temporal correspondences, Latte aggregates consecutive frames in a slide window manner and constructs Spatial-Temopral (ST) voxels to capture temporally local prediction consistency for each modality. After filtering out ST voxels with high ST entropy, Latte conducts cross-modal learning for each point and pixel by attending to those with reliable and consistent predictions among both spatial and temporal neighborhoods. Experimental results show that Latte achieves state-of-the-art performance on three different MM-TTA benchmarks compared to previous MM-TTA or TTA methods. Visit our project site <a class="link-external link-https" href="https://sites.google.com/view/eccv24-latte" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of unstable predictions in 3D semantic segmentation during Multi-Modal Test-Time Adaptation (MM-TTA). Specifically: - **Background and Problem**: Existing multi-modal test-time adaptation methods exhibit temporal instability in predictions when processing consecutive frames due to the high correlation of geometric neighborhoods between different frames. This instability can severely affect the performance of downstream tasks, such as semantic-based retrieval and obstacle recognition. - **Solution**: The authors propose a novel method called Reliable Spatial-temporal Voxels (Latte), which leverages the spatiotemporal correspondence between consecutive frames to enhance the stability and consistency of predictions. Latte aggregates consecutive frames within a sliding window and constructs spatiotemporal voxels (ST voxels) to evaluate the prediction reliability of each modality, which is further used for cross-modal learning. - **Contribution**: Latte is the first method to introduce spatiotemporal correlation into MM-TTA. By using spatiotemporal voxels and entropy to assess prediction reliability, combined with an adaptive cross-modal attention mechanism, it effectively reduces the impact of noisy modalities. Experimental results show that Latte outperforms existing TTA and MM-TTA methods on three different benchmark datasets. In summary, this paper primarily addresses the issue of unstable single-frame predictions in multi-modal test-time adaptation methods for 3D segmentation by introducing spatiotemporal information to improve the model's online adaptation performance.