Abstract:Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring Long-Tailed Multi-Label image Classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines. The code is available at <a class="link-external link-https" href="https://github.com/jiexuanyan/CPRFL" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to solve two main problems in Long - Tailed Multi - Label Image Classification (LTMLC): 1. **Class imbalance problem**: Data in the real world usually presents a long - tailed distribution, that is, the number of samples in some classes is very small (tail classes), while the number of samples in some classes is very large (head classes). This unbalanced distribution leads to poor performance of deep networks on tail classes. 2. **Multi - object recognition problem**: Different from traditional single - label classification tasks, images with multiple labels are often involved in practical application scenarios. This increases the complexity and challenge of the task, especially when multiple objects need to be recognized simultaneously. ### Solutions To address the above problems, the authors propose a new method, called Category - Prompt Refined Feature Learning (CPRFL). The main contributions of CPRFL are as follows: 1. **Utilizing class semantic relevance**: By using the text encoder of the pre - trained CLIP model to extract class semantics, the semantic relevance between head and tail classes is established. These class semantics are used as class prompts to decouple class - specific visual representations. 2. **Progressive dual - path back - propagation mechanism**: A progressive dual - path back - propagation mechanism is designed to gradually integrate context - related visual information into the prompts, thereby gradually purifying class - specific visual representations and improving their relevance and accuracy. 3. **Asymmetric loss function**: The Asymmetric Loss (ASL) is adopted as the optimization objective to effectively suppress negative samples and improve the recognition performance of head and tail classes. ### Method overview 1. **Feature extraction**: Use a backbone network (such as ResNet - 101) to extract local image features and project the features into the visual - semantic joint space through a linear layer. 2. **Semantic extraction**: Use the text encoder of the pre - trained CLIP model to extract class semantics and generate class prompts. 3. **Class - prompt initialization**: Design a prompt initialization network (PI network) to map class semantics into initial class prompts through non - linear transformation. 4. **Visual - semantic information interaction**: Design a visual - semantic interaction network (VSI network) to use the Transformer encoder for visual - semantic information interaction and decouple class - specific visual representations. 5. **Class - prompt refined feature learning**: Gradually refine the prompts through the progressive dual - path back - propagation mechanism to gradually purify class - specific visual representations. 6. **Optimization**: Adopt the Asymmetric Loss Function (ASL) as the optimization objective to deal with the imbalance problem of positive and negative samples. ### Experimental results The authors conducted experiments on two LTMLC benchmark datasets (VOC - LT and COCO - LT) to verify the effectiveness of the method. The experimental results show that CPRFL outperforms existing methods in overall performance as well as in head, middle, and tail classes. ### Formulas - **Class - prompt initialization**: \[ P=\text{GELU}(W W_1 + b_1)W_2 + b_2 \] where \(W\) is the class semantic embedding, and \(W_1, W_2, b_1, b_2\) are the weight matrices and bias vectors of the linear layer. - **Attention weight calculation**: \[ \alpha_{ij}=\text{softmax}\left(\frac{(W_q p_i)^T (W_k z_j)}{\sqrt{d}}\right) \] \[ \bar{p}_i=\sum_{j = 1}^{v + c}(\alpha_{ij}W_v z_j) \] \[ p'_i=\text{GELU}(\bar{p}_iW_r + b_3)W_o + b_4 \] - **Classification probability calculation**: \[ s_

Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition

Cross-modal Learning Using Privileged Information for Long-Tailed Image Classification

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

Category-Adaptive Label Discovery and Noise Rejection for Multi-label Recognition with Partial Positive Labels

NCL++: Nested Collaborative Learning for Long-Tailed Visual Recognition

PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition

Balanced Contrastive Learning for Long-Tailed Visual Recognition

Learning Enhanced Features and Inferring Twice for Fine-Grained Image Classification

LAMM: Label Alignment for Multi-Modal Prompt Learning

Enhanced multi-branch learning for long-tailed image recognition

Bt-Vmf Contrastive and Collaborative Learning for Long-Tailed Visual Recognition

Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

The Solution for Language-Enhanced Image New Category Discovery

Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels

Semantic-Aware Dual Contrastive Learning for Multi-label Image Classification

Balanced Classification: A Unified Framework for Long-Tailed Object Detection

Text-Guided Mixup Towards Long-Tailed Image Categorization

Label prompt for multi-label text classification