Multi-View Token Clustering and Fusion for 3D Object Recognition and Retrieval

Linlong Fan,Yanqi Ge,Wen Li,Lixin Duan
DOI: https://doi.org/10.1109/ICME55011.2023.00200
2023-01-01
Abstract:3D object recognition has received extensive attention in recent years. Many existing methods tackle the task by rendering 3D objects from multiple views. However, most multi-view recognition methods do not utilize fine-grained information from different views, which is found to be crucial for improving 3D object representation in the multi-view setting. In this paper, we propose a transformer-based method, referred to as MVCFormer, for multi-view feature clustering and fusion. MVCFormer clusters semantically similar tokens at the same stages and selects representative fine-grained features, which helps to eliminate feature redundancy and remove cluttered backgrounds and make the selected features more diverse. On the other hand, our model also integrates selected features from all stages to obtain a discriminative 3D object representation by a crossattention fusion method. Extensive experiments on benchmark datasets (e.g., ModelNet40, ModelNet10, ShapeNetCore55, and RGBD) clearly demonstrate the effectiveness of our proposed MVCFormer over existing baselines.
What problem does this paper attempt to address?