MaskMentor: Unlocking the Potential of Masked Self-Teaching for Missing Modality RGB-D Semantic Segmentation

Zhida Zhao,Jia Li,Lijun Wang,Yifan Wang,Huchuan Lu
DOI: https://doi.org/10.1145/3664647.3681698
2024-01-01
Abstract:Existing RGB-D semantic segmentation methods struggle to handle modality missing input, where only RGB images or depth maps are available, leading to degenerated segmentation performance. We tackle this issue using MaskMentor, a new pre-training framework for modality missing segmentation, which advances its counterparts via two novel designs: Masked Modality and Image Modeling (M2IM), and Self-Teaching via Token-Pixel Joint reconstruction (STTP). M2IM simulates modality missing scenarios by combining both modality- and patch-level random masking. Meanwhile, STTP offers an effective self-teaching strategy, where the trained network assumes a dual role, simultaneously acting as both the teacher and the student. The student with modality missing input is supervised by the teacher with complete modality input through both token- and pixel-wise masked modeling, closing the gap between missing and complete input modalities. By integrating M2IM and STTP, MaskMentor significantly improves the generalization ability of the trained model across diverse input conditions and outperforms state-of-the-art methods on two popular benchmarks by a considerable margin. Extensive ablation studies further verify the effectiveness of the above contributions.
What problem does this paper attempt to address?