Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

Jiexuan Yan,Sheng Huang,Nankun Mu,Luwen Huangfu,Bo Liu
2024-08-15
Abstract:Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring Long-Tailed Multi-Label image Classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines. The code is available at <a class="link-external link-https" href="https://github.com/jiexuanyan/CPRFL" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to solve two main problems in Long - Tailed Multi - Label Image Classification (LTMLC): 1. **Class imbalance problem**: Data in the real world usually presents a long - tailed distribution, that is, the number of samples in some classes is very small (tail classes), while the number of samples in some classes is very large (head classes). This unbalanced distribution leads to poor performance of deep networks on tail classes. 2. **Multi - object recognition problem**: Different from traditional single - label classification tasks, images with multiple labels are often involved in practical application scenarios. This increases the complexity and challenge of the task, especially when multiple objects need to be recognized simultaneously. ### Solutions To address the above problems, the authors propose a new method, called Category - Prompt Refined Feature Learning (CPRFL). The main contributions of CPRFL are as follows: 1. **Utilizing class semantic relevance**: By using the text encoder of the pre - trained CLIP model to extract class semantics, the semantic relevance between head and tail classes is established. These class semantics are used as class prompts to decouple class - specific visual representations. 2. **Progressive dual - path back - propagation mechanism**: A progressive dual - path back - propagation mechanism is designed to gradually integrate context - related visual information into the prompts, thereby gradually purifying class - specific visual representations and improving their relevance and accuracy. 3. **Asymmetric loss function**: The Asymmetric Loss (ASL) is adopted as the optimization objective to effectively suppress negative samples and improve the recognition performance of head and tail classes. ### Method overview 1. **Feature extraction**: Use a backbone network (such as ResNet - 101) to extract local image features and project the features into the visual - semantic joint space through a linear layer. 2. **Semantic extraction**: Use the text encoder of the pre - trained CLIP model to extract class semantics and generate class prompts. 3. **Class - prompt initialization**: Design a prompt initialization network (PI network) to map class semantics into initial class prompts through non - linear transformation. 4. **Visual - semantic information interaction**: Design a visual - semantic interaction network (VSI network) to use the Transformer encoder for visual - semantic information interaction and decouple class - specific visual representations. 5. **Class - prompt refined feature learning**: Gradually refine the prompts through the progressive dual - path back - propagation mechanism to gradually purify class - specific visual representations. 6. **Optimization**: Adopt the Asymmetric Loss Function (ASL) as the optimization objective to deal with the imbalance problem of positive and negative samples. ### Experimental results The authors conducted experiments on two LTMLC benchmark datasets (VOC - LT and COCO - LT) to verify the effectiveness of the method. The experimental results show that CPRFL outperforms existing methods in overall performance as well as in head, middle, and tail classes. ### Formulas - **Class - prompt initialization**: \[ P=\text{GELU}(W W_1 + b_1)W_2 + b_2 \] where \(W\) is the class semantic embedding, and \(W_1, W_2, b_1, b_2\) are the weight matrices and bias vectors of the linear layer. - **Attention weight calculation**: \[ \alpha_{ij}=\text{softmax}\left(\frac{(W_q p_i)^T (W_k z_j)}{\sqrt{d}}\right) \] \[ \bar{p}_i=\sum_{j = 1}^{v + c}(\alpha_{ij}W_v z_j) \] \[ p'_i=\text{GELU}(\bar{p}_iW_r + b_3)W_o + b_4 \] - **Classification probability calculation**: \[ s_