Multimodal Generalized Category Discovery

Yuchang Su,Renping Zhou,Siyu Huang,Xingjian Li,Tianyang Wang,Ziyue Wang,Min Xu
2024-09-18
Abstract:Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5\% and 4.7\%, respectively.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the problem of Multimodal Generalized Category Discovery (MM-GCD). Specifically, the goal of the paper is to classify input data in an open world, including both known and unknown categories. This task is particularly important for scientific discovery, such as identifying novel variants related to rare diseases in genetic data analysis. However, existing Generalized Category Discovery (GCD) methods mainly focus on unimodal data, neglecting the multimodal nature of real-world data. For example, radiology images (such as X-rays, CT scans) are often paired with clinical reports, with each modality providing unique information. Utilizing these rich multimodal data can improve decision accuracy, similar to how radiologists use both images and clinical reports for more accurate diagnoses. ### Main Contributions 1. **Introduction of the Multimodal Generalized Category Discovery Setting**: This setting is closer to real-world scenarios where data naturally exists in multiple modalities. 2. **Theoretical and Empirical Analysis**: Demonstrates that modality alignment is key to successfully addressing the multimodal generalized category discovery problem. 3. **Proposed New Alignment Methods**: Achieves effective alignment of feature space and output space through contrastive learning and distillation techniques, significantly improving classification performance. ### Experimental Results The paper validates the effectiveness of MM-GCD on two benchmark datasets (UPMC-Food101 and N24News) and conducts ablation studies. Experimental results show that MM-GCD achieves new state-of-the-art levels on these datasets, improving by 11.5% and 4.7% over previous methods, respectively. Utilizing multimodal data improves performance by 6.8% and 3.4% over unimodal methods, indicating that different modalities provide complementary and rich information beneficial for the GCD task. ### Method Overview 1. **Feature Space Alignment**: Ensures alignment of features from different modalities through multimodal contrastive learning, making similar concepts have similar features across modalities. 2. **Output Space Alignment**: Ensures consistent classification results across different modalities through distillation techniques. Distillation allows the prediction results of one modality to serve as the target for another modality, combined with entropy minimization techniques to ensure decision consistency between modalities. ### Conclusion Overall, this paper addresses the classification problem of multimodal data in an open world by introducing the Multimodal Generalized Category Discovery framework (MM-GCD). Through theoretical analysis and experiments, it demonstrates that modality alignment is key, and the proposed alignment methods significantly improve classification performance.