Abstract:Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5\% and 4.7\%, respectively.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the problem of Multimodal Generalized Category Discovery (MM-GCD). Specifically, the goal of the paper is to classify input data in an open world, including both known and unknown categories. This task is particularly important for scientific discovery, such as identifying novel variants related to rare diseases in genetic data analysis. However, existing Generalized Category Discovery (GCD) methods mainly focus on unimodal data, neglecting the multimodal nature of real-world data. For example, radiology images (such as X-rays, CT scans) are often paired with clinical reports, with each modality providing unique information. Utilizing these rich multimodal data can improve decision accuracy, similar to how radiologists use both images and clinical reports for more accurate diagnoses. ### Main Contributions 1. **Introduction of the Multimodal Generalized Category Discovery Setting**: This setting is closer to real-world scenarios where data naturally exists in multiple modalities. 2. **Theoretical and Empirical Analysis**: Demonstrates that modality alignment is key to successfully addressing the multimodal generalized category discovery problem. 3. **Proposed New Alignment Methods**: Achieves effective alignment of feature space and output space through contrastive learning and distillation techniques, significantly improving classification performance. ### Experimental Results The paper validates the effectiveness of MM-GCD on two benchmark datasets (UPMC-Food101 and N24News) and conducts ablation studies. Experimental results show that MM-GCD achieves new state-of-the-art levels on these datasets, improving by 11.5% and 4.7% over previous methods, respectively. Utilizing multimodal data improves performance by 6.8% and 3.4% over unimodal methods, indicating that different modalities provide complementary and rich information beneficial for the GCD task. ### Method Overview 1. **Feature Space Alignment**: Ensures alignment of features from different modalities through multimodal contrastive learning, making similar concepts have similar features across modalities. 2. **Output Space Alignment**: Ensures consistent classification results across different modalities through distillation techniques. Distillation allows the prediction results of one modality to serve as the target for another modality, combined with entropy minimization techniques to ensure decision consistency between modalities. ### Conclusion Overall, this paper addresses the classification problem of multimodal data in an open world by introducing the Multimodal Generalized Category Discovery framework (MM-GCD). Through theoretical analysis and experiments, it demonstrates that modality alignment is key, and the proposed alignment methods significantly improve classification performance.

Multimodal Generalized Category Discovery

X-Gacmn: An X-Shaped Generative Adversarial Cross-Modal Network With Hypersphere Embedding

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

A Fresh Look at Generalized Category Discovery through Non-negative Matrix Factorization

Achieving Cross Modal Generalization with Multimodal Unified Representation.

Generalized Category Discovery with Clustering Assignment Consistency

ImbaGCD: Imbalanced Generalized Category Discovery

Generalized Categories Discovery for Long-tailed Recognition

Contextuality Helps Representation Learning for Generalized Category Discovery

Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery

Federated Generalized Category Discovery

Contrastive Mean-Shift Learning for Generalized Category Discovery

Classifier-guided Gradient Modulation for Enhanced Multimodal Learning

CLIP-GCD: Simple Language Guided Generalized Category Discovery

Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery

Parametric Classification for Generalized Category Discovery: A Baseline Study

Active Generalized Category Discovery

Parametric Information Maximization for Generalized Category Discovery

SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization

Memory Consistency Guided Divide-and-Conquer Learning for Generalized Category Discovery

Pseudo-supervised contrastive learning with inter-class separability for generalized category discovery