Visual-Language Collaborative Representation Network for Broad-Domain Few-Shot Image Classification

Qianyu Guo,Jieji Ren,Haofen Wang,Tianxing Wu,Weifeng Ge,Wenqiang Zhang
DOI: https://doi.org/10.1145/3664647.3680668
2024-01-01
Abstract:Visual-language models based on CLIP have shown remarkable abilities in general few-shot image classification. However, their performance drops in specialized fields such as healthcare or agriculture, because CLIP's pre-training does not cover all category data. Existing methods excessively depend on the multi-modal information representation and alignment capabilities acquired from CLIP pre-training, which hinders accurate generalization to unfamiliar domains. To address this issue, this paper introduces a novel visual-language collaborative representation network (MCRNet), aiming at acquiring a generalized capability for collaborative fusion and representation of multi-modal information. Specifically, MCRNet learns to generate relational matrices from an information fusion perspective to acquire aligned multi-modal features. This relationship generation strategy is category-agnostic, so it can be generalized to new domains. A class-adaptive fine-tuning inference technique is also introduced to help MCRNet efficiently learn alignment knowledge for new categories using limited data. Additionally, the paper establishes a new broad-domain few-shot image classification benchmark containing seven evaluation datasets from five domains. Comparative experiments demonstrate that MCRNet outperforms current state-of-the-art models, achieving an average improvement of 13.06% and 13.73% in the 1-shot and 5-shot settings, highlighting the superior performance and applicability of MCRNet across various domains.
What problem does this paper attempt to address?