Fine-grained Prototypical Voting with Heterogeneous Mixup for Semi-supervised 2D-3D Cross-modal Retrieval

Fan Zhang,Xian-Sheng Hua,Chong Chen,Xiao Luo
DOI: https://doi.org/10.1109/cvpr52733.2024.01610
2024-01-01
Abstract:This paper studies the problem of semi-supervised 2D-3D retrieval, which aims to align both labeled and unla-beled 2D and 3D data into the same embedding space. The problem is challenging due to the complicated heteroge-neous relationships between 2D and 3D data. Moreover, label scarcity in real-world applications hinders from gen-erating discriminative representations. In this paper, we propose a semi-supervised approach named Fine-grained Prototypcical ⊻oting with Heterogeneous Mixup (FIVE), which maps both 2D and 3D data into a common embed-ding space for cross-modal retrieval. Specifically, we gen-erate fine-grained prototypes to model intra-class variation for both 2D and 3D data. Then, considering each unlabeled sample as a query, we retrieve relevant prototypes to vote for reliable and robust pseudo-labels, which serve as guid-ance for discriminative learning under label scarcity. Fur-thermore, to bridge the semantic gap between two modali-ties, we mix cross-modal pairs with similar semantics in the embedding space and then perform similarity learning for cross-modal discrepancy reduction in a soft manner. The whole FIVE is optimized with the consideration of sharp-ness to mitigate the impact of potential label noise. Exten-sive experiments on benchmark datasets validate the supe-riority of FIVE compared with a range of baselines in differ-ent settings. On average, FIVE outperforms the second-best approach by 4.74% on 3D MNIST, 12.94% on ModelNet10, and 22.10% on ModelNet40.
What problem does this paper attempt to address?